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Preface 


There have been many important developments in econometrics during the last 
two decades, but introductory books in the field still deal mostly with what 
econometrics was in the 1960s. The present book is meant to familiarize stu¬ 
dents (and researchers) with some of these developments, explaining them with 
very simple models without cluttering up the exposition with too much alge¬ 
braic detail. Where proofs involve complicated expressions they are omitted 
and appropriate references are given. Ten of the more difficult sections have 
been marked with an asterisk indicating that they are optional. Beginning stu¬ 
dents can skip them and proceed. The book also contains several examples 
illustrating the techniques at each stage, and where illustrations are not given, 
some data sets have been provided so that students can compute the required 
results themselves. 

Since the book covers quite a few topics, only a few examples have been 
given to illustrate each point. Giving too many illustrations for a single point 
might be boring for some students and would have made the book much too 
bulky. The instructor’s manual contains more illustrative examples and ques¬ 
tions and answers. The exercises given at the end of each chapter are somewhat 
challenging. However, the instructor’s manual contains answers or guidelines. 
The instructor’s manual also gives a “guided tour” of the material in each chap¬ 
ter as well as some detailed explanations for some points that are touched 
briefly in the book. 

Some of the questions at the end of the chapters have been taken from the 
examinations at several U.S. and U.K. universities, and from P. C. B. Phillips 
and M. R. Wickens, Exercises in Econometrics (Cambridge, Massachusetts, 
Ballinger Publishing Co., 1978), Vol. I. (Many of the questions in that book are 
from examinations in the U.K. universities.) Since questions tend to get re¬ 
peated with minor changes, I have not bothered to quote the exact source for 
each question. 

There are many distinguishing features of the book, some of which are: 
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1. A discussion of aims and methodology of econometrics in Chapter 1. 

2. A critique of conventional significance levels in Chapter 2. 

3 . A discussion of direct versus reverse regression and inverse prediction 
in Chapter 3. 

4 . A thorough discussion of several practically useful results in multiple 
regression (Chapter 4), some of which are not to be found even in spe¬ 
cialized books on regression. 

5. Discussion of several new tests for heteroskedasticity, choice of linear 
versus log-linear models, and use of deflators (in Chapter 5). 

6. A thorough discussion of the several limitations of the Durbin-Watson 
test, showing that it is almost useless in practice, and a discussion of 
Sargan’s test, ARCH test, LM test, etc. (Chapter 6). 

7 . A critical discussion of the use and limitations of condition numbers, 
variance inflation factors (VIF’s), ridge regression, and principal com¬ 
ponent regression in the analysis of multicollinearity (Chapter 7). Many 
of these techniques are often used uncritically, because computer pro¬ 
grams are readily available. 

8. The use of dummy variables to obtain predictions and standard errors 
of predictions, the relationship between discriminant analysis and the 
linear probability model, the logit, probit, and tobit models (Chapter 8). 

9 . Inference in under-identified simultaneous equation models, criteria for 
normalization, and tests for exogeneity and causality (Chapter 9). 

10. The discussion of partial adjustment models, error correction models, 
rational expectations models, and tests for rationality (Chapter 10). 

11. Reverse regression, proxy variables (Chapter 11). 

12. Different types of residuals and their use in diagnostic checking, model 
selection, choice of F-ratios in the selection of regressors, Hausman’s 
specification error test, tests of nonnested hypotheses (Chapter 12). 

These are not new topics for advanced texts but these topics (most of which 
are never dealt with in introductory texts) are important in econometric prac¬ 
tice. The book explains them all with simple models so that students (and re¬ 
searchers) who have not had exposure to advanced texts and advanced courses 
in econometrics can still learn them and use them. 

I have avoided any special discussion of computer programs. Nowadays, 
there are several computer packages available that one can choose from. Also, 
they change so rapidly that the lists have to be expanded and updated very 
often. The instructor’s manual will provide some guidance on this. I feel that 
it is more important to know the answers to the questions of what to compute, 
why to compute, how to interpret the results, what is wrong with the model, 
and how we can improve the model. Learning how to compute is rather easy. 
In the 1960s this last question received more attention because computer tech¬ 
nology had not progressed enough and not many efficient computer programs 
were readily available. With the advances in computer technology and the large 
number of computer programs readily available, everyone can easily learn 
“how to compute.” That is why I have tried to minimize the discussion of 
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computer programs or the presentation of computer output. Moreover, there is 
no single program that will do all the things discussed in the book. But by 
simple adaptation many of the computer programs available will do the job. 

Instructors using the book might find it difficult to cover all the material. 
However, it is always better to have some topics in the book that instructors 
can choose from depending on their interests and needs. Chapters 1 through 9 
cover the basic material. The last three chapters, on models of expectations, 
errors in variables, and model selection, are topics that instructors can pick 
and choose from. A one-semester course would cover Chapters 1 to 6 (or 7). 
A two-semester course would cover Chapters 1 to 9 and parts of Chapters 
10-12. In either case Chapter 2 need not be covered but can be used as a ref¬ 
erence. 


Second Edition 

1. Addition of matrix algebra. Many who have used the first edition found 
the omission of matrix algebra a handicap. Since the purpose of the book 
was to convey the basic ideas without using the “crutches” of matrix 
notation. I have decided to retain the original format and add matrix al¬ 
gebra in the appendixes to Chapters 2, 4, 5, 7, 9, and 12. The appendix 
to Chapter 2 gives all the necessary background in matrix algebra and 
there are some exercises at the end of the appendix. 

2. Exercises on statistical inference. Chapter 2 presents a review of statis¬ 
tical inference. Those who used the book found the review too short, but 
expanding it would have made the book unnecessarily long. However, 
students are expected to have had a beginning statistics course. To make 
the review more useful, a list of questions has been added to Chapter 2. 
Students will get a sufficient review by attempting to solve these ques¬ 
tions. 

3. Addition of chapters on time series, unit roots, and cointegration. One 
major drawback of the first edition has been the complete omission of 
time-series analysis. This has been corrected by adding Chapter 13. In 
addition, the important recent developments of unit roots and cointegra¬ 
tion have been covered in Chapter 14. There is currently no book that 
covers these topics at an introductory level. 

Some instructors might want to cover Chapters 6, 10, 13, and 14 together 
because they are all on time-series topics. There is no need to cover the topics 
in the order in which they appear in the book. 

At some places I have a reference to my earlier book: Econometrics (New 
York, McGraw-Hill, 1977) for details. I saw no point in reproducing some 
proofs or derivations where they were not absolutely necessary for understand¬ 
ing the points being made. At several others, there is a footnote saying that the 
result or proof can be found in many books in econometrics, and I give a ref¬ 
erence to my earlier book with page numbers. However, the same result can 
be often found in other standard references, such as: 
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J. Johnston, Econometric Methods (New York, McGraw-Hill), 3rd ed., 
1984. 

J. Kmenta, Elements of Econometrics (New York, The Macmillan Co.), 2nd 
ed., 1986. 

H. Theil, Principles of Econometrics (New York, Wiley), 1971. 

G. G. Judge, C. R. Hill, W. E. Griffiths, H. Liitkepohl, and T. C. Lee, The¬ 
ory and Practice of Economics (New York, Wiley), 2nd ed., 1985. 

E. Malinvaud, “Statistical Methods of Econometrics” (Amsterdam, North 
Holland), 3rd ed., 1976. 

Since I did not find it practicable to give detailed page numbers for every 
book, I have just referred to my earlier book. 

I would like to thank Jack Repcheck for first encouraging me to write this 
book and Ken MacLeod and John Travis for bringing the first edition to com¬ 
pletion. I would also like to thank Caroline Carney for initiating this second 
edition and Jill Lectka and David Boelio for their encouragement in bringing it 
to completion. 

I would like to thank Richard Butler, Melanie Courchene, Jinook Jeong, Fred 
Joutz, Sang-Heung Kang, In-Moo Kim, Jongmin Lee, Wanki Moon, Marc Ner- 
love, Mark Rush, W. Douglas Shaw, Robert Trost, Ken White, Jisoo Yoo, and 
many others for their comments and suggestions. 

I would like to thank Kajal Lahiri, Scott Shonkwiler, and Robert R Trost for 
their help on the first edition. 

I would like to thank In-Moo Kim for his invaluable help in preparing this 
second edition. Without his assistance, this edition would have taken much 
longer to complete. I would also like to thank Ann Crisafulli for typing several 
versions of the additions and corrections cheerfully and efficiently. Finally, I 
would like to thank all those who used the book and also passed on helpful 
comments. 


G. S. Maddala 



What Is Econometrics? 




1.1 What Is Econometrics? 

1.2 Economic and Econometric Models 

1.3 The Aims and Methodology of Econometrics 

1.4 What Constitutes a Test of an Economic Theory? 
Summary and an Outline of the Book 


1.1 What Is Econometrics? 


Literally speaking, the word “econometrics” means “measurement in econom¬ 
ics.” This is too broad a definition to be of any use because most of economics 
is concerned with measurement. We measure our gross national product, em¬ 
ployment, money supply, exports, imports, price indexes, and so on. What we 
mean by econometrics is: 

The application of statistical and mathematical methods to the analysis of 
economic data, with a purpose of giving empirical content to economic the¬ 
ories and verifying them or refuting them. 

In this respect econometrics is distinguished from mathematical economics, 
which consists of the application of mathematics only, and the theories derived 
need not necessarily have an empirical content. 
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1 WHAT IS ECONOMETRICS? 


The application of statistical tools to economic data has a very long history. 
Stigler 1 notes that the first “empirical” demand schedule was published in 1699 
by Charles Davenant and that the first modern statistical demand studies were 
made by Rodulfo Enini, an Italian statistician, in 1907. The main impetus to 
the development of econometrics, however, came with the establishment of the 
Econometric Society in 1930 and the publication of the journal Econometrica 
in January 1933. 

Before any statistical analysis with economic data can be done, one needs a 
clear mathematical formulation of the relevant economic theory. To take a very 
simple example, saying that the demand curve is downward sloping is not 
enough. We have to write the statement in mathematical form. This can be 
done in several ways. For instance, defining q as the quantity demanded and p 
as price, we can write 

<7 = a + [3/? £<0 

or 

q = Ap p (3 < 0 

As we will see later in the book, one of the major problems we face is the 
fact that economic theory is rarely informative about functional forms. We 
have to use statistical methods to choose the functional form, as well. 

1.2 Economic and Econometric Models 


The first task an econometrician faces is that of formulating an econometric 
model. What is a model? 

A model is a simplified representation of a real-world process. For instance, 
saying that the quantity demanded of oranges depends on the price of oranges 
is a simplified representation because there are a host of other variables that 
one can think of that determine the demand for oranges. For instance, income 
of consumers, an increase in diet consciousness (“drinking coffee causes can¬ 
cer, so you better switch to orange juice,” etc.), an increase or decrease in the 
price of apples, and so on. However, there is no end to this stream of other 
variables. In a remote sense even the price of gasoline can affect the demand 
for oranges. 

Many scientists have argued in favor of simplicity because simple models are 
easier to understand, communicate, and test empirically with data. This is the 
position of Karl Popper 2 and Milton Friedman. 3 The choice of a simple model 
to explain complex real-world phenomena leads to two criticisms: 

‘G. J. Stigler, “The Early History of Empirical Studies of Consumer Behavior,” The Journal 
of Political Economy, 1954 [reprinted in G. J. Stigler, Essays in the History of Economics (Chi¬ 
cago: University of Chicago Press, 1965)]. 

2 K. F. Popper, The Logic of Scientific Discovery (London: Hutchinson, 1959), p. 142. 

'M. Friedman, “The Methodology of Positive Economics,” in Essays in Positive Economics 
(Chicago: University of Chicago Press, 1953), p. 14. 



.2 ECONOMIC AND ECONOMETRIC MODELS 


3 


1. The model is oversimplified. 

2. The assumptions are unrealistic. 

For instance, in our example of the demand for oranges, to say that it de¬ 
pends on only the price of oranges is an oversimplification and also an unreal¬ 
istic assumption. To the criticism of oversimplification, one can argue that it is 
better to start with a simplified model and progrsesively construct more com¬ 
plicated models. This is the idea expressed by Koopmans. 4 On the other hand, 
there are some who argue in favor of starting with a very general model and 
simplifying it progressively based on the data available. 5 The famous statisti¬ 
cian L. J. (Jimmy) Savage used to say that “a model should be as big as an 
elephant.” Whatever the relative merits of this alternative approach area, we 
will start with simple models and progressively build more complicated models. 

The other criticism we have mentioned is that of “unrealistic assumptions.” 
To this criticism Friedman argued that the assumptions of a theory are never 
descriptively realistic. He says: 6 

The relevant question to ask about the “assumptions” of a theory is not 
whether they are descriptively “realistic” for they never are, but whether 
they are sufficiently good approximations for the purpose at hand. And this 
question can be answered by only seeing whether the theory works, which 
means whether it yields sufficiently accurate predictions. 

Returning to our example of demand for oranges, to say that it depends only 
on the price of oranges is a descriptively unrealistic assumption. However, the 
inclusion of other variables, such as income and price of apples in the model, 
does not render the model more descriptively realistic. Even this model can be 
considered to be based on unrealistic assumptions because it leaves out many 
other variables (like health consciousness, etc.). But the issue is which model 
is more useful for predicting the demand for oranges. This issue can be decided 
only from the data we have and the data we can get. 

In practice, we include in our model all the variables that we think are rele¬ 
vant for our purpose and dump the rest of the variables in a basket called 
“disturbance.” This brings us to the distinction between an economic model 
and an econometric model. 

An economic model is a set of assumptions that approximately describes the 
behavior of an economy (or a sector of an economy). An econometric model 
consists of the following: 

1. A set of behavioral equations derived from the economic model. These 
equations involve some observed variables and some “disturbances” 
(which are a catchall for all the variables considered as irrelevant for the 
purpose of this model as well as all unforeseen events). 


4 T. C. Koopmans, Three Essays on the State of Economics Science (New York: McGraw-Hill, 
1957), pp. 142-143. 

’This is the approach suggested by J. D. Sargan and notably David F. Hendry. 

‘Friedman, “Methodology,” pp. 14-15. 
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1 WHAT IS ECONOMETRICS'' 


2. A statement of whether there are errors of observation in the observed 
variables. 

3. A specification of the probability distribution of the “disturbances” (and 
errors of measurement). 

With these specifications we can proceed to test the empirical validity of the 
economic model and use it to make forecasts or use it in policy analysis. 

Taking the simplest example of a demand model, the econometric model usu¬ 
ally consists of: 

1. The behavioral equation 

q = a + + u 

where q is the quantity demanded and p the price. Here p and q are the 
observed variables and u is a disturbance term. 

2. A specification of the probability distribution of u which says that E(u\p) 

= 0 and that the values of u for the different observations are indepen¬ 
dently and normally distributed with mean zero and variance a 2 . 

With these specifications one proceeds to test empirically the law of demand 
or the hypothesis that p < 0. One can also use the estimated demand function 
for prediction and policy purposes. 


1.3 The Aims and Methodology 
of Econometrics 


The aims of econometrics are: 

1. Formulation of econometric models, that is, formulation of economic 
models in an empirically testable form. Usually, there are several ways 
of formulating the econometric model from an economic model because 
we have to choose the functional form, the specification of the stochastic 
structure of the variables, and so on. This part constitutes the specifica¬ 
tion aspect of the econometric work. 

2. Estimation and testing of these models with observed data. This part con¬ 
stitutes the inference aspect of the econometric work. 

3. Use of these models for prediction and policy purposes. 

During the 1950s and 1960s the inference aspect received a lot of attention 
and the specification aspect very little. The major preoccupation of econome¬ 
tricians had been the statistical estimation of correctly specified econometric 
models. During the late 1940s the Cowles Foundation provided a major break¬ 
through in this respect, but the statistical analysis presented formidable com¬ 
putational problems. Thus the 1950s and 1960s were spent mostly in devising 
alternative estimation methods and alternative computer algorithms. Not much 
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Figure 1.1. Schematic description of the steps involved in an econometric analysis of 
economic models. 


attention was paid to errors in the specification or to errors in observations. 7 
With the advent of high-speed computers, all this has, however, changed. The 
estimation problems are no longer formidable and econometricians have turned 
attention to other aspects of econometric analysis. 

We can schematically depict the various steps involved in an econometric 
analysis, as was done before the emphasis on specification analysis. This is 
shown in Figure 1.1. Since the entries in the boxes are self-explanatory, we will 
not elaborate on them. The only box that needs an explanation is box 4, “prior 
information.” This refers to any information that we might have on the un¬ 
known parameters in the model. This information can come from economic 
theory or from previous empirical studies. 

There has, however, been considerable dissatisfaction with the scheme 
shown in Figure 1.1. Although one can find instances of dissatisfaction earlier, 


There was some work on specification errors in the early 1960s by Theil and Griliches, but this 
referred to omitted-variable bias (see Chapter 4). Griliches’ lecture notes (unpublished) at the 
University of Chicago were titled “Specification Errors in Econometrics.” 
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1 WHAT IS ECONOMETRICS'' 


it was primarily during the 1970s that arguments were levied against the one¬ 
way traffic shown in Figure 1.1. We will discuss three of these arguments. 

1. In Figure 1.1 there is no feedback from the econometric testing of eco¬ 
nomic theories to the formulation of economic theories (i.e., from box 6 
to box 1). It has been argued that econometricians are not just hand¬ 
maidens of economic theorists. It is not true that they just take the the¬ 
ories they are given and test them, learning nothing from the tests. So we 
need an arrow from box 6 to box 1. 

2. The same is true regarding the data collection agencies. It is not true that 
they just gather whatever data they can and the econometricians use 
whatever data are given them. (The word data comes from the Latin word 
datum, which means given.) There ought to be feedback from boxes 2 
and 5 to box 3. 

3. Regarding box 6 itself, it has been argued that the hypothesis testing re¬ 
fers only to the hypotheses suggested by the original economic model. 
This depends on the assumption that the specification adopted in box 2 
is correct. However, what we should be doing is testing the adequacy of 
the original specification as well. Thus we need an extra box of specifi¬ 
cation testing and diagnostic checking. There will also be feedback from 
this box to box 2, that is, the specification tests will result in a new spec¬ 
ification for the econometric models. These problems, which have been 
one of the most important major developments in econometrics in the 
1970s, are treated in Chapter 12. 

The new developments that have been suggested are shown figuratively in 
Figure 1.2. Some of the original boxes have been deleted or condensed. The 
schematic description in Figure 1.2 is illustrative only and should not be inter¬ 
preted literally. The important things to note are the feedback 

1. From econometric results to economic theory. 

2. From specification testing and diagnostic checking to revised specifica¬ 
tion of the economic model. 

3. From the econometric model to data. 

In the foregoing scheme we have talked of only one theory, but often there 
are many competing theories, and one of the main purposes of econometrics is 
to help in the choice among competing theories. This is the problem of model 
selection, which is discussed in Chapter 12. 

1.4 What Constitutes a Test 
of an Economic Theory? 


Earlier, we stated that one of the aims of econometrics was that of testing 
economic theories. An important question that arises is: What constitutes a 
test? As evidence of a successful test of economic theory, it is customary to 
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Figure 1.2. Revised schematic description of the steps involved in an econometric analysis 
of economic models. 

report that the signs of the estimated coefficients in an econometric model are 
correct. This approach may be termed the approach of confirming economic 
theories. The problem with this approach is that, as Mark Blaug points out: 8 

In many areas of economics, different econometric studies reach conflicting 
conclusions and, given the available data, there are frequently no effective 
methods for deciding which conclusion is correct. In consequence, contra¬ 
dictory hypotheses continue to co-exist sometimes for decades or more. 

A more valid test of an economic theory is if it can give predictions that are 
better than those of alternative theories suggested earlier. Thus one needs 
to compare a given model with earlier models. This approach of evaluating 

'M Blaug, The Methodology of Economics (Cambridge: Cambridge University Press, 1980), 

P 261. 
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1 WHAT IS ECONOMETRICS? 


alternative theories has received greater attention in recent years. The prob¬ 
lems associated with model comparison and model selection are dealt with in 
Chapter 12. 


Summary and an Outline of the Book 


The preceding discussion suggests some changes that have taken place during 
the last decade in the development of econometric methods. The earlier .em¬ 
phasis (during the 1950s and 1960s) was on efficient estimation of a given 
model. The emphasis has now shifted to specification testing, diagnostic check¬ 
ing, model comparison, and an adequate formulation of the expectational vari¬ 
ables given the pervasive role of expectations in almost all economic theories. 

A large part of the book, however, will be devoted to developments that took 
place in the 1950s and 1960s because these developments form the basis for 
recent work. However, Chapter 12 is devoted to recent developments in the 
areas of specification testing, diagnostic checking, and model selection. In ad¬ 
dition, the other chapters also describe recent developments. For instance, ra¬ 
tional expectations models are discussed in Chapter 10, recent developments 
on tests for serial correlation are reviewed in Chapter 6, and vector autoregres¬ 
sions, unit roots, and cointegration are discussed in Chapter 14. 

The plan of the book is as follows: 

Chapter 2 contains a review of some basic results in statistics. Most students 
will have covered this material in introductory courses in statistics. 

Chapter 3 begins with the simple (two-variable) regression model. Some of 
the more complicated sections, such as those on inverse prediction, stochastic 
regressors, and regression fallacy, are denoted as optional and are marked with 
an asterisk. Beginning students may want to skip these. 

In Chapter 4 we discuss the multiple regression model in great detail. It is 
the longest chapter in the book because it covers several issues concerning the 
interpretation of multiple regression coefficients. A detailed discussion of the 
model with two explanatory variables is given to fix the ideas. 

Chapters 5 and 6 deal, respectively, with the problems of unequal error vari¬ 
ances and serially correlated errors in the multiple regression model. Chapter 
6, in particular, deals extensively with the usual approach of using the Durbin- 
Watson test and discusses several situations where its use would be undesir¬ 
able. 

Chapter 7 is devoted to the multicollinearity problem. A separate chapter on 
this problem is perhaps not necessary, but it was found that inclusion of this 
material in any other chapter would have made that chapter far too long. 

Chapter 8 deals with dummy variables that are used as explanatory variables 
in several contexts. We also discuss special problems created when the depen¬ 
dent variable is a dummy variable or a truncated variable. There is a discussion 
of logit, probit and tobit models. 
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In Chapter 9 we give an elementary introduction to simultaneous-equations 
models and to the identification problem in econometrics. 

In Chapter 10 we consider distributed lag models and models of expecta¬ 
tions. We also give an elementary introduction to rational expectations models. 
Models with lags in behavior and expectations play an important role in most 
econometric models. 

In Chapter 11 we consider errors in variables and proxy variables. 

In Chapter 12 we present an introduction to some recent work on diagnostic 
checking, specification testing, and model selection. 

In Chapter 13 we present an introduction to time-series analysis. 

In Chapter 14 we discuss recent developments in time-series analysis. 

Throughout, an attempt has been made to explain complicated material in 
simple terms. To avoid cumbersome algebra, many results have been stated 
without proofs although at times algebraic detail is unavoidable. Given readily 
available computer packages, even this algebraic detail may be unnecessary, 
but it is good for students to understand at least some minimal steps involved 
in the construction of computer packages. This will help in understanding the 
deficiencies and limitations of the various procedures. 
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2 STATISTICAL BACKGROUND AND MATRIX ALGEBRA 


2.1 Introduction 


In this chapter we review some basic results in probability and statistics that 
are used in the book. Most of the results are generally covered in a course in 
introductory statistics that students take before entering a course in economet¬ 
rics. Hence the discussion is concise, and proofs are omitted. However, exer¬ 
cises at the end of the chapter will provide some practice in the use of concepts 
and results presented here. Students should attempt to work on these questions 
as well, as part of this review. This chapter serves two purposes: 

1. As a review of some material covered in previous courses in statistics. 

2. As a reference for some results when needed. 

In this chapter we omit discussion of methods of estimation: method of mo¬ 
ments, method of least squares, and the method of maximum likelihood as they 
are discussed in Chapter 3 when we cover simple regression. 

The other thing this chapter provides is an introduction to matrix algebra. 
This is included in an appendix because the main body of the text does not 
involve matrix notation and explains the basic concepts with simple models. 
Chapters 4, 5, 7, 9, and 12 also contain appendixes that contain proofs of some 
results stated without proof in the respective chapters. These proofs need the 
matrix notation. Thus students wishing to learn the proofs and the matrix no¬ 
tation would start with this appendix and then go through the appendixes to 
the other chapters. The appendix to this chapter also contains exercises on 
matrix algebra that students are advised to work on. 

2.2 Probability 


The term probability is used to give a quantitative measure to the uncertainty 
associated with our statements. The earliest notions of probability were first 
applied to gambling. Backgammon was known thousands of years ago. Al¬ 
though most of the gamblers presumably calculated some probabilities, it is 
Gerolamo Cardano (1501-1576), an Italian physician, mathematician, and as¬ 
trologer, who is credited with the first systematic computation of probabilities. 
He defined probability as 

, , ... number of favorable outcomes 

probability = -----—- 

total number of possible outcomes 

This is called the classical view of probability. Suppose that we roll a balanced 
die. What is the probability of getting a 6? The number of favorable outcomes 
is one since there is only one 6. The total number of possible outcomes is six 
(1,2, 3, 4, 5, 6). Hence the probability is g. 

Another definition of probability is that of the relative frequency of the oc¬ 
currence of an event in a large number of repetitions. In the case of the die, we 
calculate the probability of a 6 by rolling the die a large number of times and 
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then observing the proportion of times we get a 6. This observed proportion 
will give us the probability of a 6. You would think that nobody would have the 
patience to compute probabilities this way. However, there is a well-known 
example of Kerrick, who while interned in Denmark during World War II, per¬ 
formed several such experiments. For instance, he tossed a coin 10,000 times! 
Initially, the relative frequency of heads fluctuated widely, but it finally settled 
close to 0.5, with a value of 0.507 on the final toss. The experiments are de¬ 
scribed in J. E. Kerrick, An Experimental Introduction to the Theory of Prob¬ 
ability (Copenhagen: Jorgensen, 1946). 

The frequency view of probability, first given by the French mathematician 
Poisson in 1837, says that if n is the number of trials and n{E) the number of 
occurrences of the event E, then the probability of E, denoted by P{E) is 

«E) - lim 

n—> oo n 

According to the classical view, probability is a theoretical number defined as 
the ratio of the number of favorable cases to the total number of possible cases. 
According to the frquency view, it is a limit of the observed relative frequency 
as the number of repetitions becomes very large. 

A third view is the subjective view of probability, which is based on personal 
beliefs. Suppose you say that the probability that the New York Giants will win 
is |. Consider a bet where you are given $1 if the Giants win and you pay $3 if 
they lose. Your probability indicated that this is a “fair” bet. If you are not 
willing to take this bet, your subjective probability that the Giants will win is 
< f. If you are too anxious to take this bet, your subjective probability that the 
Giants will win is > |. Because of the “betting” basis for subjective probability, 
it is also called “personal pignic probability.” The work “pignic” comes from 
the Latin word pignus (a bet). 

Addition Rules of Probability 

Fortunately, however, whatever view of probability we adopt, the rules for the 
calculation of probability are the same. Before we discuss these, we have to 
define some terms. Events A„ A 2 , A 3 , . . . are said to be mutually exclusive if 
when one occurs, the others do not. They are said to be exhaustive if they 
exhaust all the possibilities. In the case of a die, the events A,, A 2 , . . . , A 6 
that the die shows 1, 2, 3, 4, 5, and 6 are mutually exclusive as well as ex¬ 
haustive. 

We write P(A + B) as the probability that A or B of the events A and B 
occur. This is called the union of the two events. We write P{AB) as the prob¬ 
ability of the joint occurrence of A and B. This is called the intersection of the 
events A and B. 

For example, if we define A and B as: 

A: The die shows 1, 3, or 5 
B: The die shows 3 
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then 


A + B: The die shows 1, 3, or 5 
AB: The die shows 3 

The addition rule of probability states that 

P(A + B' = P(A) + P(B) - P(AB) 

(We can show this by drawing a diagram known as the Venn diagram. This is 
left as an exercise.) If A and B are mutually exclusive, they cannot occur 
jointly, so P(AB) = 0. Thus for mutually exclusive events we have 

P(A + B) = P(A) + P(B) 

If, in addition, A and B are exhaustive, P(A) + P(B) = 1. 

We denote by A the complement of A. A represents the nonoccurrence of A. 
Since either A occurs or does not (i.e., A occurs), A and A are mutually exclu¬ 
sive and exhaustive. Hence P(A) + P(A) = 1 or P(A) = 1 - P(A). We can 
also write P(A ) = P(AB) + P(AB) because A can occur jointly with B or with¬ 
out B. 


Conditional Probability and the Multiplication Rule 


Sometimes we restrict our attention to a subset of all possible events. For in¬ 
stance, suppose that when we throw a die, the cases 1, 2, and 3 do not count. 
Thus the restricted set of events is that the die shows 4, 5, or 6. There are three 
possible outcomes. Consider the event A that the die shows a 6. The probability 
of A is now -i since the total number of outcomes is 3, not 6. Conditional prob¬ 
ability is defined as follows: The probability of an event A, given that another 
event B has occurred, is denoted by P(A\B) and is defined by 


P{A\B) = 


P(AB) 

P(B) 


In the case above, P(AB) = g, P(B) — f, and hence P(A\B) = 3. We shall now 
define independent events. A and B are said to be independent if the probability 
of the occurrence of one does not depend on whether or not the other has 
occurred. Thus if A and B are independent, the conditional and unconditional 
probabilities are the same, that is, P{A\B) = P{A) and P(B\A) = P(B). Since 
P(A\B) = P(AB)IP(B ) and P{A\B) = P(A), we get the multiplication rule, which 
says that 


P{AB) = P(A) x P(B) if A and B are independent 

As an example, suppose that we throw a die two times: 

A = event that the first throw shows a 6 
B = event that the second throw shows a 6 


Clearly, A and B are independent events. Hence Prob(we get a double 6) = 
P(AB) = P(A) • P(B) = S V 
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Bayes’ Theorem 


Bayes’ theorem is based on conditional probability. We have 

/VII®. ^ and ABIA, ™ 


P(B ) 

Write the second equation as 

P(AB) = P(B\A) ■ P(A) 

Then we get 

P(B\A) ■ P(A ) 
P(B) 


P(A) 


P(A\B) 


This is known as Bayes’ theorem. It appeared in a text published in 1763 by 
Reverend Thomas Bayes, a part-time mathematician. 

Let H x and H 2 denote two hypotheses and D denote the observed data. Let 
us substitute //, and H 2 in turn for A and substitute D for B. Then we get 


P{H x \D) 


P(D\H X ) ■ P{H X ) 
P{D) 


and 


P(H 2 \D) = 


P(D\H 2 ) • P(H 2 ) 
P{D) 


Hence we get 


P(H t \D) = F(D|//,) P(H { ) 
P(H 2 \D ) P(D\H 2 ) ' P(H 2 ) 


The left-hand side of this equation is called posterior odds. The first term on 
the right-hand side is called the likelihood ratio, and the second term is called 
the prior odds. We shall make use of this in Chapter 12 for the problem of 
choice between two models. For the present let us consider the following ex¬ 
ample. 

We have two urns: the first has 1 red ball and 4 white balls, and the second 
has 2 red balls and 2 white balls. An urn is chosen at random and a ball drawn. 
The ball is white. What is the probability that it came from the first urn? Let 
us define: 


H x : The first urn was chosen 
H 2 : The second urn was chosen 
D: Data, that is, the ball is white 


We have />(//,) = P{H 2 ) = i Also, P{D\H X ) = f and P(D\H 2 ) = i Hence 
P(H x \D)!P{H 2 \D) = f or P(H X \D) = A and P(H 2 \D) = *. The required 
probability is t%. 


Summation and Product Operations 

In the examples we considered, we often had only two events. If we have n 
events, we have to use the summation operator 2 and the product operator fl. 
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Since we shall be using these in other contexts as well, we shall discuss them 
here. 

The summation operator 2 is defined as follows: 

tx, = X, + x 2 + ■ ■ ■ x„ 

I — 1 

Some important properties of this operator are: 

1* 2/'=i c ~ nc < where c is a constant. 

2 * 2r= ( (a + y.) = 22 ( + 22, >r 

3 * 22 1 ( c + bx t ) = nc + h 22 1 *,• 

Where there is no confusion we will just write 2/ X, or 2 Y, to denote 
22 1 that is, summation over all the X’s. 

Sometimes we use the double-summation operator, which is defined as fol¬ 
lows: 


2 2 x,j = x n + x a + ■ • ■ + x ln 

‘=1 2=' 

+ z 2I + x 22 + • • ■ + x 2m 


Again, where there is no confusion that /' runs from 1 to n and j runs from 1 to 
m, we will just write 2<2. / Y'y. Yet another notation is 2<2//' , ./Y, ( , which de¬ 
notes summation over all values of i and j except those for which i = j. For 
instance, suppose that i goes from 1 to 3 and j goes from 1 to 2. Then 

22 y,, = x u + Aj 2 + x 2 , + x 22 + x M + x 32 

l J 


But 


22 -Yy — x l2 + x 2l + x 2l + x i2 

l J 
l*J 


That is, we have to omit all the terms for which / = j. As yet another example, 
consider 


(A, + X 2 + X,) 2 = X 2 + X\ + Xj + 2X,X 2 + 2X 2 X 3 + 2X,X, 



22 x,x t 


= 2 + 22 

i i j 

i*j 


We can write this as 
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The product operator II is defined as 

fu = x,x 2 ■ ■■x„ 

Again, where there is no confusion as to the limits of /, we will just write 
I], A, or 17 X,. As with the 2 operator, we can also use the double-product 
operator, ni t- For instance, if we have variables X>, X 2 , X 3 and Y., Y 2 , then 

= x,x 2 x 3 y,y 2 . 

2.3 Random Variables and 
Probability Distributions 


A variable X is said to be a random variable (rv) if for every real number a 
there exists a probability P(X < a) that X takes on a value less than or equal 
to a. We shall denote random variables by capital letters X, Y, Z, and so on. 
We shall use lowercase letters, x, y, z, and so on, to denote particular values 
of the random variables. Thus P{X = x) is the probability that the random 
variable X takes the value x. P(x, < X < x 2 ) is the probability that the random 
variable X takes values between x, and x 2 , both inclusive. 

If the random variable X can assume only a particular finite (or countably 
infinite) set of values, it is said to be a discrete random variable. A random 
variable is said to be continuous if it can assume any value in a certain range. 
An example of a discrete random variable is the number of customers arriving 
at a store during a certain period (say, the first hour of business). An example 
of a continuous random variable is the income of a family in the United States. 
In actual practice, use of continuous random variables is popular because the 
mathematical theory is simpler. For instance, when we say that income is a 
continuing random variable, we do not mean that it is continuous (in fact, 
strictly speaking, it is discrete) but that it is a convenient approximation to treat 
it that way. 

A formula giving the probabilities for different values of the random variable 
X is called a probability distribution in the case of discrete random variables, 
and probability density function (denoted by p.d.f.) for continuous random 
variables. This is usually denoted by /(x). 

In general, for a continuous random variable, the occurrence of any exact 
value of X may be regarded as having a zero probability. Hence probabilities 
are discussed in terms of some ranges. These probabilities are obtained by in¬ 
tegrating /(x) over the desired range. For instance, if we want Prob(a =£ A < 
b), this is given by 

Prob(« < X < b) = [ b f(x) dx 
Ja 
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The probability that the random variable X takes on values at or below a num¬ 
ber c is often written as F(c) - ProblA' s c). The function F{x) represents, for 
different values of x, the cumulated probabilities and hence is called the cu¬ 
mulative disstribution function (denoted by c.d.f.) Thus 


F(c) = ProbO*' < c) = J f(x) dx 


Joint, Marginal, and Conditional Distributions 

We are often interested in not just one random variable but in the relationship 
between several random variables. Suppose that we have two random vari¬ 
ables, X and Y. Now we have to consider: 

1. The joint p.d.f.:/(x, y). 

2. The marginal p.d.f.’s: f(x) and f(y). 

3. The conditional p.d.f.’s: 

(a) /(x|y), which is the distribution of X given Y = y. 

(b) /(y|x), which is the distribution of Y given that X = x. 

The joint density can be written as the product of the marginal and condi¬ 
tional densities. Thus 

fix, y) = f(x)f(y\x) 

= f(y)f(x\y) 

If f{x, y) = fix)fiy) for all x and y, then x and y are said to be independent. 
Note that if they are independent, 

/C*|y) = fix) and /(y|x) = /(y) 

that is, the conditional distributions are the same as the marginals. This makes 
intuitive sense because for X, whether or not Y is fixed at a certain level, is 
irrelevant. Similarly, for Y it should be irrelevant at what level we fix X. 

Illustrative Example 

Consider, for instance, the discrete distribution of X and Y defined by the fol¬ 
lowing probabilities: 


Nss '\ 
y \j 

3 

4 

5 

2 

0.2 

0 

0.2 

4 

0 

0.2 

0 

6 

0.2 

0 

0.2 

fix) 

0.4 

0.2 

0.4 


fiy ) 

*fiy\x 

= 3) # fiy\x 

= 4) # fiy\. 

0.4 

0.5 

0 

0.5 

0.2 

0 

1.0 

0 

0.4 

0.5 

0 

0.5 
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Since the conditional distributions of v depend on the values of X and Y 
cannot be independent. On the other hand, if the distribution of X and Y is 
defined as 


X 

3 

4 

5 

fiy) 

0.5 

= fiy\x = 3) 
0.5 

= /(y|* = 4) 
0.5 

= fiy\- 

0.5 

2 

0.2 

0.2 

0.1 

4 

0.12 

0.12 

0.06 

0.3 

0.3 

0.3 

0.3 

6 

0.08 

0.08 

0.04 

0.2 

0.2 

0.3 

0.2 

fix) 

0.4 

0.4 

0.2 






we see that the conditional distributions of y for the different values of x and 
the marginal distribution of y are the same and hence X and Y are independent. 


2 A The Normal Probability Distribution 
and Related Distributions 


If we are given the probability distribution of a random variable X, we can 
determine the probability that X lies in an interval (a, b ). There are some prob¬ 
ability distributions for which the probabilities have been tabulated and which 
are considered suitable descriptions for a wide variety of phenomena. These 
are the normal distribution and the t, x 2 , and F distributions. We discuss these 
and also the lognormal and bivariate normal distributions. There are other dis¬ 
tributions as well, such as the gamma and beta distributions, for which exten¬ 
sive tables are available. In fact, the x 2 -distribution is a particular case of the 
gamma distribution and the t and F distributions are particular cases of the 
beta distribution. We do not need all these relationships here. 

There is also a question of whether the normal distribution is an appropriate 
one to use to describe economic variables. However, even if the variables are 
not normally distributed, one can consider transformations of the variables so 
that the transformed variables are normally distributed. 


The Normal Distribution 


The normal distribution is a bell-shaped distribution which is used most exten¬ 
sively in statistical applications in a wide variety of fields. Its probability den¬ 
sity function is given by 


f(x) = 





— 00 < X < + 00 


Its mean is p and its variance is a 2 . When x has the normal distribution with 
mean p and variance or 2 , we write this compactly as x ~ N(p, or 2 ). 
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An important property of the normal distribution is that any linear function 
of normally distributed variables is also normally distributed. This is true 
whether the variables are independent or correlated. If 

x, ~ cr?) and x 2 ~ N(p, 2 , 

and the correlation between x, and x 2 is p, then 

a x x x + a 2 x 2 ~ N(a,p-i + a 2 p, 2 , a\6\ + o 2 ct 2 + 2pfl|U 2 a|Cr 2 ) 

In particular, 

x t + x 2 ~ N(p., + jjlj, u] + ct 2 + 2po-,a 2 ) 
and 

x, - * 2 ~ N(p., - p, 2 , a] + o\ - 2pa,cr 2 ) 

Related Distributions 

In addition to the normal distribution, there are other probability distributions 
that we will be using frequently. These are the x 2 , t, and F distributions tabu¬ 
lated in the Appendix. These distributions are derived from the normal distri¬ 
bution and are defined as described below. 

X 2 - Distribution 

If X|, x 2 , . . . , x„ are independent normal variables with mean zero and variance 
1, that is, x, ~IN(0, 1), i = 1,2,...,«, then 

z = 2x1 

i 

is said to have the x 2 -distribution with degrees of freedom n, and we will write 
this as Z ~ x 2 - The subscript n denotes degrees of freedom. The x 2 distribution 
is the distribution of the sum of squares of n independent standard normal vari¬ 
ables. 

If x, ~ IN(0, cr 2 ), then Z should be defined as 


The x 2 *distribution also has an “additive property,” although it is different from 
the property of the normal distribution and is much more restrictive. The prop¬ 
erty is: 

If Z, ~ x 2 and Z 2 ~ Xm and Z, and Z 2 are independent, 
then Z, + Z 2 ~ x 2 +m . 

Note that we need independence and we can consider simple additions only, 
not any general linear combinations. Even this limited property is useful in 
practical applications. There are many distributions for which even this limited 
property does not hold. 
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t-Distribution 

If jc — N(0, 1 ) and v ~ xl and x and y are independent, Z = xlxfyln has a /- 
distribution with degrees of freedom n. We write this as Z ~ /„. The subscript 
n again denotes the degrees of freedom. Thus the /-distribution is the distri¬ 
bution of a standard normal variable divided by the square root of an indepen¬ 
dent averaged \ 2 variable (x 2 variable divided by its degrees of freedom). The 
/-distribution is a symmetric probability distribution like the normal distribu¬ 
tion but is flatter than the normal and has longer tails. As the degrees of free¬ 
dom n approaches infinity, the /-distribution approaches the normal distribu¬ 
tion. 

F-Distribution 

If y, ~ X«, and y 2 ~ xl, an d yi and y 2 are independent, Z = (y,/n,)/(y 2 //i 2 ) has 
the F-distribution with degrees of freedom (d.f.) n { and n 2 . We write this as 

Z ~ F„! „ 2 

The first subscript, n u refers to the d.f. of the numerator, and the second sub¬ 
script, n 2 , refers to the d.f. of the denominator. The F-distribution is thus the 
distribution of the ratio of two independent averaged x 2 variables. 

2.5 Classical Statistical Inference 


Statistical inference is the area that describes the procedures by which we use 
the observed data to draw conclusions about the population from which the 
data came or about the process by which the data were generated. Our as¬ 
sumption is that there is an unknown process that generates the data we have 
and that this process can be described by a probability distribution, which, in 
turn, can be characterized by some unknown parameters. For instance, for a 
normal distribution the unknown parameters are fx and <r 2 . 

Broadly speaking, statistical inference can be classified under two headings: 
classical inference and Bayesian inference. Classical statistical inference is 
based on two premises: 

1. The sample data constitute the only relevant information. 

2. The construction and assessment of the different procedures for inference 
are based on long-run behavior under essentially similar circumstances. 

In Bayesian inference we combine sample information with prior informa¬ 
tion. Suppose that we draw a random sample y„ y 2 , . . . , y„ of size n from a 
normal population with mean /x and variance ct 2 (assumed known), and we want 
to make inferences about fx. 

In classical inference we take the sample mean y as our estimate of jx. Its 
variance is cr/n. The inverse of this variance is known as the sample precision. 
Thus the sample precision is n/cr 2 . 

In Bayesian inference we have prior information on jx. This is expressed in 
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terms of a probability distribution known as the prior distribution. Suppose that 
the prior distribution is normal with mean jul 0 and variance of, that is, precision 
1/of. We now combine this with the sample information to obtain what is known 
as the posterior distribution of |x. This distribution can be shown to be normal. 
Its mean is a weighted average of the sample mean y and the prior mean |x 0 , 
weighted by the sample precision and prior precision, respectively. Thus 

p (Bayesian) = + ~ W ^ 

Wj + w 2 

where w x = n/u 2 = sample precision 

h’ 2 = 1/of, = prior precision 

Also, the precision (or inverse of the variance) of the posterior distribution of 
|x is w x + w 2 , that is, the sum of sample precision and prior precision. 

For instance, if the sample mean is 20 with variance 4 and prior mean is 10 
with variance 2, we have 

posterior mean = 


posterior valance = 

The posterior mean will lie between the sample mean and the prior mean. The 
posterior variance will be less than both the sample and prior variances. 

We do not discuss Bayesian inference in this book, because this would take 
us into a lot more detail than we intend to cover. 1 However, the basic notion 
of combining the sample mean and prior mean in inverse proportion to their 
variances will be a useful one to remember. 

Returning to classical inference, it is customary to discuss classical statistical 
inference under three headings: 

1. Point estimation. 

2. Interval estimation. 

3. Testing of hypotheses. 

Point Estimation 

Suppose that the probability distribution involves a parameter © and we have a 
sample of size n, namely y„ y 2 , , y„, from this probability distribution. In 

point estimation we construct a function g(y„ y 2 , ... , y„ ) of these observa¬ 
tions and say that g is our estimate (guess) of ©. The common terminology is 
to call the function g(y„ y 2 , . . . , y„) an estimator and its value in a particular 
sample an estimate. Thus an estimator is a random variable and an estimate is 


i(20) + M10) 


i _i_ i 
4 ' 2 


13.33 


10 

3 

4 

(1 j_ £\~l — 4 _ 
\4 ■ 2 / — 3 — 


1.33 


‘For more discussion of Bayesian econometrics, refer to A. Zellner, Introduction to Bayesian 
Analysis in Econometrics (New York: Wiley, 1971), and E. E. Learner, Specification Searches : 
Ad Hoc Inference with IS on-experimental Data (New York: Wiley, 1978). 
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a particular value of this random variable. For instance, if 6 is the population 
mean and g(y„ y 2 , , y „) = (1 In) ^ y, = y is the sample mean, y is an 

estimator of 0. If y = 4 in a particular sample, 4 is an estimate of 0. 

In interval estimation we construct two functions, g^y,, y 2 , ... , y„) and 
? 2 (yi» y 2 > ■ • • > y„)-> of the sample observations and say that 0 lies between g, 
and g 2 with a certain probability. In hypothesis testing we suggest a hypothesis 
about 6 (say, 0 = 4.0) and examine the degree of evidence in the sample in 
favor of the hypothesis, on the basis of which we either accept or reject the 
hypothesis. 

In practice what we need to know is how to construct the point estimator g, 
the interval estimator (g,, g 2 ), and the procedures for testing hypotheses. In 
the classical statistical inference all these are based on sampling distributions. 
Sampling distributions are probability distributions of functions of sample ob¬ 
servations. For instance, the sample mean y is a function of the sample obser¬ 
vations and its probability distribution is called the sampling distribution of y. 
In classical statistical inference the properties of estimators are discussed in 
terms of the properties of their sampling distributions. 

2.6 Properties of Estimators 


There are some desirable properties of estimators that are often mentioned in 
the book. These are: 

1. Unbiasedness. 

2. Efficiency. 

3. Consistency. 

The first two are small-sample properties. The third is a large-sample prop¬ 
erty. 

Unbiasedness 

An estimator g is said to be unbiased for 0 if E(g) = 0, that is, the mean of the 
sampling distribution of g is equal to 0. What this says is that if we calculate g 
for each sample and repeat this process infinitely many times, the average of 
all these estimates will be equal to 0. If E{g) ^ 0, then g is said to be biased 
and we refer to E(g) — 0 as the bias. 

Unbiasedness is a desirable property but not at all costs. Suppose that we 
have two estimators g, and g 2 , and g, can assume values far away from 0 and 
yet have its mean equal to 0, whereas g 2 always ranges close to 0 but has its 
mean slightly away from 0. Then we might prefer g 2 to g, because it has smaller 
variance even though it is biased. If the variance of the estimator is large, we 
can have some unlucky samples where our estimate is far from the true value. 
Thus the second property we want our estimators to have is a small variance. 
One criterion that is often suggested is the mean-squared error (MSE), which 
is defined by 
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MSE = (bias) 2 + variance 

The MSE criterion gives equal weights to these components. Instead, we can 
consider a weighted average W(bias) 2 + (1 - IV) variance. Strictly speaking, 
instead of doing something ad hoc like this, we should specify a loss function 
that gives the loss in using g(y,, . . . , y„) as an estimator of 6 and choose g to 
minimize expected loss. 


Efficiency 

The property of efficiency is concerned with the variance of estimators. Ob¬ 
viously, it is a relative concept and we have to confine ourselves to a particular 
class. If g is an unbiased estimator and it has the minimum variance in the class 
of unbiased estimators, g is said to be an efficient estimator. We say that g is 
an MVUE (a minimum-variance unbiased estimator). 

If we confine ourselves to linear estimators, that is, g = C|y, + c 2 y 2 + • • ■ 
+ where the c’s are constants which we choose so that g is unbiased 
and has minimum variance, g is called a BLUE (a best linear unbiased estim¬ 
ator). 


Consistency 

Often it is not possible to find estimators that have desirable small-sample prop¬ 
erties such as unbiasedness and efficiency. In such cases it is customary to look 
at desirable properties in large samples. These are called asymptotic proper¬ 
ties. Three such properties often mentioned are consistency, asymptotic un¬ 
biasedness, and asymptotic efficiency. 

Suppose that 0„ is the estimator of 6 based on a sample of size n. Then the 
sequence of estimators 0„ is called a consistent sequence if for any arbitrarily 
small positive numbers e and 8 there is a sample size n 0 such that 

Prob [|0„ — ©| < e] > 1 — 8 for all n > n 0 

That is, by increasing the sample size n the estimator 0„ can be made to lie 
arbitrarily close to the true value of 0 with probability arbitrarily close to 1. 
This statement is also written as 

lim ,P(|0„ — 0| < e) = 1 

and more briefly we write it as 

0 p —*■ 0 or plim 0„ = 0 

(plim is “probability limit”). 0„ is said to converge in probability to 0. In prac¬ 
tice we drop the subscript n on 0„ and also drop the words “sequence of esti¬ 
mators” and merely say that 0 is a consistent estimator for 0. 

A sufficient condition for 0 to be consistent is that the bias and variance 
should both tend to zero as the sample size increases. This condition is often 
useful to check in practice, but it should be noted that the condition is not 



2.6 PROPERTIES OF ESTIMATORS 


25 


necessary. An estimator can be consistent even if the bias does not tend to 
zero. 

There are also some relations in probability limits that are useful in proving 
consistency. These are 

1. plim (c,y, + c 2 y 2 ) = c, plim y, + c 2 plim y 2 , where c, and c 2 are constants. 

2. plim (y,y 2 ) = (plim y,)(plim y 2 ). 

3. plim (y,/y 2 ) = (plim y,)/(plim y 2 ) provided that plim y 2 A 0. 

4. If plim y = c and g(y) is a continuous function of y, then plim g(y) = 
g{c). 

Other Asymptotic Properties 

In addition to consistency, there are two other concepts, asymptotic unbiased¬ 
ness and asymptotic efficiency, that are often used in discussions in economet¬ 
rics. To explain these concepts we first have to define the concept of limiting 
distribution, which is as follows: If we consider a sequence of random variables 
y„ y 2 , y 3 , . . . with corresponding distribution functions F\, F 2 , F 3 , ... , this 
sequence is said to converge in distribution to a random variable y with distri¬ 
bution function F if F„(x) converges to F(x) as n -> °° for all continuity points 
ofF. 

It is important to note that the moments of F are not necessarily the moments 
of F„. In fact, in econometrics, it is often the case that F„ does not have mo¬ 
ments but F does. 

Example 

Suppose that x is a random variable with mean p ¥= 0 and variance a 2 and F(x 
= 0) = c > 0. Suppose that we have a random sample of size n: x t , x 2 , , 

x n . Let x„ be the sample mean. The subscript n denotes the sample size. Define 
the random variable y„ = 1 lx„. Then E(y n ) does not exist because there is a 
positive probability that x, = x 2 = • • • x„ = 0. Thus y„ = 00 if jc„ = 0. However, 
in this case it can be shown that xfn (l/x„ - 1/p) has a limiting distribution that 
is normal with mean 0 and variance o- 2 /p 4 . Thus even if the distribution of y„ 
does not have a mean and variance, its limiting distribution has a mean 1/p and 
variance cr 2 //zp 4 . 

When we consider asymptotic distributions, we consider \fn times the esti¬ 
mator because otherwise the variance is of order 1 In, which —» 0 as n —»• °°. 
Thus when we compare two estimators T, and T 2 , the variances of the asymp¬ 
totic distributions both —» 0 as n —> Hence in discussions of efficiency, we 
compare the variances of the distributions of VnT, and VnT 2 . We shall now 
define asymptotic unbiasedness and asymptotic efficiency formally. 

Asymptotic Unbiasedness 

The estimator 0„, based on a sample of size n, is an asymptotically unbiased 
estimator of 0 if the mean of the limiting distribution of \/n (0„ — 0) is zero. 
We denote this by writing 

AEf^,,) = 0 (AE = asymptotic expectations) 
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Sometimes, an alternative definition is used: 0„ is an asymptotically unbiased 
estimator of 0 if lim„-H»°° £(0„) = 0. The problem with this definition is that 
£(0„) may not exist, as in the case of the example we cited earlier. 


Asymptotic Variance 

Again we have two definitions: one the variance of the limiting distribution, the 
other the limit of variances. The problem with the latter definition is that 
var(0„) may not exist. The two definitions are: 

!• AE [Vn (0„ - 0)] 2 if we consider the variance of the limiting distribution, 
liny-^oo £[Vn (0„ - 0)] 2 if we consider the limit of variances. 


Some Examples 

Consistency and asymptotic unbiasedness are different, and the two concepts 
of unbiasedness are also different; that is, plim(0„), AE(0„), and lim £(0„) are 
not all the same. Suppose that we have a sample of size n from a normal dis¬ 
tribution N(Q, 1). Consider 0„ = x as an estimator of 0. Then 

plim(0„) = AE(0„) = lim £(0„) = 0 

But suppose that we define 


Then 


1 1 A 


E(0„) = -0 + n ^ ] 0 and lim^^ £(0J = 0 

But plim(0„) = ix, + |0 ^ 0. Thus plim(0„) ^ lim £(0„). Thus we have an 
asymptotically unbiased estimator of 0 that is not consistent. As another ex¬ 
ample, consider l/x„ as an estimator of 1/0. We have plim(l/x„) = 1/0, so that 
the estimator is consistent. Also, AE(l/x„) = 1/0. But lim £(l/x„) does not 
exist. Thus the estimator is asymptotically unbiased or not depending on what 
definition we use. 

Very often in the econometric literature, the asymptotic variances are ob¬ 
tained by replacing AE by plim and evaluating plims. (The implicit assumption 
is that they are equal.) Thus the asymptotic variance of 0„ is evaluated as 

AE[Vn(0„ - 0)] 2 = plim[Vn(0„ - 0)] 2 


2.7 Sampling Distributions for Samples 
from a Normal Population 


The most commonly used sampling distributions are those for samples from 
normal populations. We state the results here (without proof). Suppose that we 
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have a sample of n independent observations, namely, y„ y 2 , . . . , y„, from a 
normal population with mean p and variance ct 2 . Consider the following: 


y 

s 2 


— t 2 (y, - ft 

n — 1 


sample mean 
sample variance 


( 2 . 1 ) 


Then 

1. The sampling distribution of the sample mean y is also normal with mean 
p and variance u 2 /n. 

2. (n - l)S 2 lu 2 has a x 2 -distribution with degrees of freedom (n - 1). Fur¬ 
ther, the distributions of y and 5 2 are independent. 

3. Since 


Vn(y - p) 
ct 


~ N( 0, 1) 


and 


(n - 1)S 2 


xZ-j 


and these distributions are independent, we have (by the definition of the t- 
distribution as the distribution of a standard normal variable divided by the 
square root of an independent averaged x 2 variable) that \/n(y — p )/S has 
a f-distribution with degrees of freedom (n — 1). 

4. Also, E{y) = p and E(S 2 ) = ct 2 and thus y and S 2 are unbiased estimators 
for p and a 2 , respectively. 

These sampling distributions will also be used to get interval estimates and 
for testing hypotheses about p and a 2 . 


2.8 Interval Estimation 


In interval estimation we construct two functions g,(y,, y 2 , , y„) and g 2 (yi, 

y 2 , . . . , y n ) of the sample observations such that 

Prob(g, < 6 < g 2 ) = a a given probability (2.2) 

a is called the confidence coefficient and the interval (g,, g 2 ) is called the con¬ 
fidence interval. Since 0 is a parameter (or a constant that is unknown), the 
probability statement (2.2) is a statement about g, and g 2 and not about 0. What 
it implies is that if we use the formulas g^y,, y 2 , ... , y„) and g 2 (>’i, y 2 , . . . , 
y „) repeatedly with different samples and in each case construct the confidence 
intervals using these formulas, then in 100a percent of all the cases (samples) 
the interval given will include the true value. 

As an illustration of how to use the sampling distributions to construct con¬ 
fidence intervals, consider a sample y„ y 2 , ..., y„ of n independent observa¬ 
tions from a normal population with mean p and variance a 2 . Then 
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(« - 1)5 2 


(X* 


Xn-l 


and 


Vn(y - jx) 


where y and 5 2 are the sample mean and sample variance defined in (2.1). 

If the sample size n is 20, so that the degrees of freedom = n - l = 19, we 
can refer to the \ 2 tables with 19 degrees of freedom and say that 

Prob^^j- > 32.852j = 0.025 (2.3) 


or 


Prob 


10.117 < 


195 2 

a 2 


< 30.144 


0.90 


(2.4) 


Also, referring to the f-tables with 19 degrees of freedom, we find that 

Vn(y ~ M-) 


Prob ^ — 2.093 < 
From equation (2.4) we get 


< 2.093 


0.95 


(2.5) 


Prob 


195 2 

30.144 


< 


2 1952 \ 
° < 10.117/ 


0.90 


and if S 2 = 9.0 we get the 90% confidence interval for a 2 as (5.7, 16.9). From 
equation (2.5) we get 


Prob 


2.0935 

-vT <ti<y + 


2.0935\ 
Vn ) 


0.95 


If y = 5 and 5 = 3.0 we get the 95% confidence interval for p. as (3.6, 6.4). 

These intervals are called two-sided intervals. One can also construct one¬ 
sided intervals. For instance, equation (2.3) implies that 

Prob ( < ’ !< 3^2) = 0 025 

or 

Pl '° b (‘’ ! 0 975 

If 5 2 = 9, we get the 97.5% (right-sided) confidence interval for a 2 as (5.205, 
00 ). We can construct similar one-sided confidence interval for p.. 


2.9 Testing of Hypotheses 


We list here some essential points regarding hypothesis testing. 

1. A statistical hypothesis is a statement about the values of some parame¬ 
ters in the hypothetical population from which the sample is drawn. 
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2. A hypothesis whifch says that a parameter has a specified value is called 
a point hypothesis. A hypothesis which says that a parameter lies in a specified 
interval is called an internal hypothesis. For instsance, if p, is the population 
mean, then H: p, = 4 is a point by hypothesis. H: 4 s p, < 7 is an interval 
hypothesis. 

3. A hypothesis test is a procedure that answers the question of whether the 
observed difference between the sample value and the population value hy¬ 
pothesized is real or due to chance variation. For instance, if the hypothesis 
says that the population mean p. = 6 and the sample mean y = 8, then we want 
to know whether this difference is real or due to chance variation. 

4. The hypothesis we are testing is called the null hypothesis and is often 
denoted by H 0 . 2 The alternative hypothesis is denoted by //,. The probability 
of rejecting H 0 when, in fact, it is true, is called the significance level. To test 
whether the observed difference between the data and what is expected under 
the null hypothesis H 0 is real or due to chance variation, we use a test statistic. 
A desirable criterion for the test statistic is that its sampling distribution be 
tractable, preferably with tabulated probabilities. Tables are already available 
for the normal, t, x 2 , and F distributions, and hence the test statistics chosen 
are often those that have these distributions. 

5. The observed significance level or P-value is the probability of getting a 
value of the test statistic that is as extreme or more extreme than the observed 
value of the test statistic. This probability is computed on the basis that the 
null hypothesis is correct. For instance, consider a sample of n independent 
observations from a normal population with mean p. and variance cr 2 . We want 
to test 


H n : p, = 7 against H x : p, # 7 
The test statistic we use is 

Vn(y - p.) 

, = 5 - 

which has a t-distribution with (« — 1) degrees of freedom. Suppose that n = 
25, y — 10, 5 = 5. Then under the assumption that H 0 is true, the observed 
value of t is t 0 = 3. Since high positive values of t are evidence against the null 
hypothesis H 0 , the P-value is [since degrees of freedom (n — 1) = 24] 

P = Prob(/ 24 > 3) 

This is the observed significance level. 

6. It is common practice to say simply that the result of the test is (statisti¬ 
cally) significant or not significant and not report the actual P-values. The 
meaning of the two terms is as follows: 

2 This terminology is unfortunate because “null” means “zero, void, insignificant, amounts to 
nothing, etc.” A hypothesis p = 0 can be called a null hypothesis, but a hypothesis p = 100 
should not be called a “null” hypothesis. However, this is the standard terminology that was 
introduced m the 1930s by the statisticians Jerzy Neyman and E. S. Pearson. 
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Statistically significant. Sampling variation is an unlikely explanation of the 
discrepancy between the null hypothesis and sample values. 

Statistically insignificant. Sampling variation is a likely explanation of the 
discrepancy between the null hypothesis and the sample value. 

Also, the terms significant and highly significant are customarily used to 
denote “significant at the 0.05 level” and “significant at the 0.01 level” respec¬ 
tively. However, consider two cases where the F-values are 0.055 and 0.045, 
respectively. Then in the former case one would say that the results are “not 
significant” and in the latter case one would say that the results are “signifi¬ 
cant,” although the sample evidence is marginally different in both cases. Sim¬ 
ilarly, two tests with P-values of 0.80 and 0.055 will both be considered “not 
significant,” although there is a tremendous difference in the compatibility of 
the sample evidence with the null hypothesis in the two cases. That is why 
many computer programs print out P-values. 3 

7. Statistical significance and practical significance are not the same thing. 
A result that is highly significant statistically may be of no practical significance 
at all. For instance, suppose that we consider a shipment of cans of cashews 
with expected mean weight of 450 g. If the actual sample mean of weights is 
449.5 g, the difference may be practically insignificant but could be highly sta¬ 
tistically significant if we have a large enough sample or a small enough sam¬ 
pling variance (note that the test statistic has \fn in the numerator and S in the 
denominator). On the other hand, in the case of precision instruments, a part 
is expected to be of length 10 cm and a sample had a mean length of 9.9 cm. If 
n is small and S is large, the difference could not be statistically significant but 
could be practically very significant. The shipment could simply be useless. 

8. It is customary to reject the null hypothesis //„ when the test statistic is 
statistically significant at a chosen significance level and not to reject H 0 when 
the test statistic is not statistically significant at the chosen significance level. 
There is, however, some controversy on this issue which we discuss later in 
item 9. In reality, H 0 may be either true or false. Corresponding to the two 
cases of reality and the two conclusions drawn, we have the following four 
possibilities: 


Result of the Test 

Reality 

H„ Is True 

//(, Is False 

Significant (reject 

Type I error or a 

Correct conclusion 

H 0 ) 

error 


Not significant (do 

Correct conclusion 

Type II error or p 

not reject H 0 ) 


error 


J A question of historical interest is: “How did the numbers 0.05 and 0.01 creep into all these 
textbooks?” The answer is that they were suggested by the famous statistician Sir R. A. Fisher 
(1890-1962), the “father” of modern statistics, and his prescription has been followed ever 
since. 
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There are two possible errors that we can make: 

1. Rejecting //„ when it is true. This is called the type I error or a error. 

2. Not rejecting H 0 when it is not true. This is called the type II error or (3 
error. 

Thus 

a = Prob(rejecting H 0 \H 0 is true) 

P = Prob(not rejecting H 0 \H 0 is not true) 

a is just the significance level, defined earlier. (1 - p) is called the power of 
the test. The power of the test cannot be computed unless the alternative hy¬ 
pothesis H t is specified; that is, H 0 is not true means that //, is true. 

For example, consider the problem of testing the hypothesis. 

H 0 : p, = 10 against H x : p. = 15 

for a normal population with mean p and variance cr 2 . The test statistic we use 
is t = \/n(x - |x)/.S . From the sample data we get the values of n, x, and S. 
To calculate a we use p = 10, and to calculate p we use p = 15. The two errors 
are 

a = Prob(t > t*|p = 10) 

P = Prob(t < r*|p = 15) 

where t* is the cutoff point of t that we use to reject or not reject H 0 . The 
distributions of t under H 0 and H { are shown in Figure 2.1. In our example a is 
the right-hand tail area from the distribution of t under H 0 and p is the left- 
hand tail area from the distribution of t under H x . If the alternative hypothesis 
H x says that p < 0, then the distribution of t under H x would be to the left of 
the distribution of t under H 0 . In this case a would be the left-hand tail area of 


fftIH o) f(M t ) 



Figure 2.1. Type I and type II errors in testing a hypothesis. 
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the distribution of t under H 0 and (3 would be the right-hand tail area of the 
distribution of t under //,. 

The usual procedure that is suggested (which is called the Neyman-Pearson 
approach) is to fix a at a certain level and minimize p, that is, choose the test 
statistic that has the most power. In practice, the tests we use, such as the t, 
X 2 , and F tests, have been shown to be the most powerful tests. 

9. There are some statisticians who disagree with the ideas of the Neyman- 
Pearson theory. For instance, Kalbfleisch and Sprott 4 argue that it is a gross 
simplification to regard a test of significance as a decision rule for accepting or 
rejecting a hypothesis. They argue that such decisions are made on more than 
just experimental evidence. Thus the purpose of a significance test is just to 
quantify the strength of evidence in the data against a hypothesis expressed in 
a (0, 1) scale, not to suggest an accept-reject rule (see item 7). There are some 
statisticians who think that the significance level used should depend on the 
sample size. 

The problem with a preassigned significance level is that if the sample size is 
large enough, we can reject every null hypothesis. This is often the experience 
of those who use large cross-sectional data sets with thousands of observa¬ 
tions. Almost every coefficient is significant at the 5% level. Lindley 5 argues 
that for large samples one should use lower significance levels and for smaller 
samples higher significance levels. Learner 6 derives significance levels in the 
case of regression models for different sample sizes that show how significance 
levels should be much higher than 5% for small sample sizes and much lower 
than 5% for large sample sizes. 

Very often the purpose of a test is to simplify an estimation procedure. This 
is the “pretesting” problem, where the test is a prelude to further estimation. 
In the case of such pretests it has been found that the significance levels to be 
used should be much higher than the conventional 5% (sometimes 25 to 50% 
and even 99%). The important thing to note is that tests of significance have 
several purposes and one should not use a uniform 5% significance level. 7 

2.10 Relationship Between Confidence 
Interval Procedures and 
Tests of Hypotheses 


There is a close relationship between confidence intervals and tests of hy¬ 
potheses. Suppose that we want to test a hypothesis at the 5% significance 

4 J. G. Kalbfleisch and D. A. Sprott. “On Tests of Significance,” in W. L. Harper and C. A. 
Hooker (eds.). Foundations of Probability Theory, Statistical Inference, and Statistical Theory 
of Science, Vol. 2 (Boston: D. Reidel, 1976), pp. 259-272. 

5 D. V. Lindley, “A Statistical Paradox,” Biometrika, 1957, pp. 187-192. 

'’Learner, Specification Searches. 

7 A good discussion of this problem is in the series of papers “For What Use Are Tests of Hy¬ 
potheses and Tests of Significance” in Communications in Statistics, A. Theory and Methods, 
Vol. 5, No. 8, 1976. 
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level. Then we can construct a 95% confidence interval for the parameter under 
consideration and see whether the hypothesized value is in the interval. If it is, 
we do not reject the hypothesis. If it is not, we reject the hypothesis. This 
relationship holds good for tests of parameter values. There are other tests, 
such as goodness-of-fit tests, tests of independence in contingency tables, and 
so on, for which there is no confidence interval counterpart. 

As an example, consider a sample of 20 independent observations from a 
normal distribution with mean jx and variance cr 2 . Suppose that the sample 
mean is y = 5 and sample variance S 2 = 9. We saw from equation (2.5) that 
the 95% confidence interval was (3.6, 6.4). Suppose that we consider the prob¬ 
lem of testing the hypothesis 


H 0 : p, = 7 against //,: p. ¥= 1 


If we use a 5% level of significance we should reject H 0 if 


Vn(y - 7) 
5 


>2.093 


since 2.093 is the point from the /-tables for 19 degrees of freedom such that 
Prob(- 2.093 < / < 2.093) = 0.95 or Prob(|/| > 2.093) = 0.05. In our example 
Vn(y - 7)1 S = - 3 and hence we reject H 0 . Actually, we would be rejecting 
H 0 at the 5% significance level whenever H 0 specifies p. to be a value outside 
the interval (3.6, 6.4). 

For a one-tailed test we have to consider the corresponding one-sided con¬ 
fidence interval. If H 0 : p. = 7 and //,: p. < 7, then we reject H 0 for low values 
of / = Vn(y — 7)1 S. From the /-tables with 19 degrees of freedom, we find that 
Prob(/ < — 1.73) = 0.05. Hence we reject H 0 if the observed / < —1.73. In our 
example it is - 3 and hence we reject H 0 . The corresponding 95% one-sided 
confidence interval is given by 


Prob 


-1.73 < 



= 0.95 


which gives (on substituting n = 20, y = 5, S = 3) the confidence interval 
(-oo,6.15). 


Summary 


1. If two random variables are uncorrelated, this does not necessarily imply 
that they are independent. A simple example is given in Section 2.3. 

2. The normal distribution and the related distributions: /, x 2 . and F form 
the basis of all statistical inference in the book. Although many economic data 
do not necessarily satisfy the assumption of normality, we can make some 
transformations that would produce approximate normality. For instance, we 
consider log wages rather than wages. 
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3. The advantage of the normal distribution is that any linear function of 
normally distributed variables is normally distributed. For the ^-distribution a 
weaker property holds. The sum of independent x 2 variables has a x 2 -distribu- 
tion. These properties are very useful in deriving probability distributions of 
sample statistics. The t, x 2 , and F distributions are explained in Section 2.4. 

4. A function of the sample observations is called a statistic (e.g., sample 
mean, sample variance). The probability distribution of a statistic is called the 
sampling distribution of the statistic. 

5. Classical statistical inference is based entirely on sampling distributions. 
By contrast, Bayesian inference makes use of sample information and prior 
information. We do not discuss Bayesian inference in this book, but the basic 
idea is explained in Section 2.5. Based on the prior distribution (which incor¬ 
porates prior information) and the sample observations, we obtain what is 
known as the posterior distribution and all our inferences are based on this 
posterior distribution. 

6. Classical statistical inference is usually discussed under three headings: 
point estimation, interval estimation, and testing of hypotheses. Three desir¬ 
able properties of point estimators—unbiasedness, efficiency, and consis¬ 
tency—are discussed in Section 2.6. 

7. There are three commonly used methods of deriving point estimators: 

(a) The method of moments. 

(b) The method of least squares. 

(c) The method of maximum likelihood. 

These are discussed in Chapter 3. 

8. Section 2.8 presents an introduction to interval estimation and Section 2.9 
gives an introduction to hypothesis testing. The interrelationship between the 
two is explained in Section 2.10. 

9. The main elements of hypothesis testing are discussed in detail in Section 
2.9. Most important, arguments are presented as to why it is not desirable to 
use the usual 5% significance level in all problems. 


Exercises 


1. The COMPACT computer company has 4 applicants, all equally qualified, 
of whom 2 are male and 2 female. The company has to choose 2 candidates, 
and it does not discriminate on the basis of sex. If it chooses the two can¬ 
didates at random, what is the probability that the two candidates chosen 
will be the same sex? A student answered this question as follows: There 
are three possible outcomes: 2 female, 2 male, and 1 female and 1 male. 
The number of favorable outcomes is two. Hence the probability is f. Is 
this correct? 

2. Your friend John says: “Let’s toss coins. Each time I’ll toss first and then 
you. If either coin comes up heads, I win. If neither, you win.” You say, 
“There are three possibilities. You may get heads first and the game ends 
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there. Or you may get tails and I get heads, or we may both get tails. In 
two of the three possible cases, you win.” “Right,” says your friend, “I 
have two chances to your one. So I’ll put up $2 against $1.” Is this a fair 
bet? 

3. A major investment company wanted to know what proportions of inves¬ 
tors bought stocks, bonds, both stocks and bonds, and neither stocks nor 
bonds. The company entrusted this task to its statistician, who, in turn, 
asked her assistant to conduct a telephone survey of 200 potential investors 
chosen at random. The assistant, however, was a moonlighter with some 
preconceptions of his own, and he cooked up the following results without 
actually doing a survey. For a sample of 200, he reported the following 
results: 

Invested in stocks: 100 

Invested in bonds: 75 

Invested in both stocks and bonds: 45 

Invested in neither: 90 

The statistician saw these results and fired his moonlighting assistant. 
Why? 

4. (The birthday problem) (In this famous problem it is easier to calculate the 
probability of a complementary event than the probability of the given 
event.) Suppose your friend, a teacher, says that she will pick 30 students 
at random from her school and offer you an even bet that there will be at 
least two students who have the same birthdays. Should you accept this 
bet? 

5. (Getting the answer without being sure you have asked the question) This 
is the randomized response technique due to Stanley L. Warner’s paper in 
Journal of the American Statistical Association, Vol. 60 (1965), pp. 63-69. 
You want to know the proportion of college students who have used drugs. 
A direct question will not give frank answers. Instead, you give the stu¬ 
dents a box containing 4 blue, 3 red, and 4 white balls. Each student is 
asked to draw a ball (you do not see the ball) and abide by the following 
instructions, based on the color of the ball drawn: 

Blue: Answer the question: “Have you used drugs?” 

Red: Answer “yes.” 

White: Answer “no.” 

If 40% of the students answered “yes,” what is the percentage of students 
who have used drugs? 

6. A petroleum company intends to drill for oil in four different wells. Use 
the subscripts 1, 2, 3,4 to denote the different wells. Each well is given an 
equally likely chance of being successful. Consider the following events: 

Event A: There are exactly two successful wells. 

Event B: There are at least three successful wells. 

Event C: Well number 3 is successful. 

Event D: There are fewer than 3 successful wells. 
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Compute the following probabilities: 


(a) P(A) 

(b) 

P(B) 

(c) 

P{C) 

(d) 

P(D) 

(e) P(AB) 

(f) 

P(A + B ) 

(g) 

P{BC) 

(h) 

P(B + C ) 

(0 P(BD) 

(j) 

P(C) 

(k) 

P{A\B) 

(1) 

P(B\C) 

(m) P(C\D) 

(n) 

P{D\C) 






7. On the TV show “Let’s Make a Deal,” one of the three boxes (A, B, C) 
on the stage contains keys to a Lincoln Continental. The other two boxes 
are empty. A contestant chooses box B. Boxes A and C remain on the 
table. Monty Hall, the energetic host of the show, suggests that the con¬ 
testant surrender her box for $500. The contestant refuses. Monty Hall 
then opens one of the remaining boxes, box A, which turns out to be 
empty. Monty now offers $1000 to the contestant to surrender her box. She 
again refuses but asks whether she can trade her box B for box C on the 
table. Monty exclaims, “That’s weird.” But is it really weird, or does the 
contestant know how to calculate probabilities? Hint: Suppose that there 
are n boxes. The probability that the key is in any of the (n — 1) boxes on 
the stage = (n - 1 )ln. Monty opens p boxes, which turn out to be empty. 
The probability that the key is in any of the (n - p - 1) boxes is ( n - 1)1 
n(n - p -1). Without switching, the probability that the contestant wins 
remains at 1 In. Hence it is better to switch. In this case n = 3, p = 1. The 
probability of winning without a switch is h- The probability of winning by 
switching is f. 

8. You are given 10 white and 10 black balls and two boxes. You are told that 
your instructor will draw a ball from one of the two boxes. If it is white, 
you pass the exam. If it is black, you fail. How should you arrange the 
balls in the boxes to maximize your chance of passing? 

9. Suppose that the joint probability distribution of X and F is given by the 
following table. 


X \ 

2 

4 

6 

1 

0.2 

0 

0.2 

2 

0 

0.2 

0 

3 

0.2 

0 

0.2 


(a) Are X and F independent? Explain. 

(b) Find the marginal distributions of X and F. 

(c) Find the conditional distribution of F given X = 1 and hence E( Y \ 
X = l)andvar(F| X = 1). 

(d) Repeat part (c) for X = 2 and X = 3 and hence verify the result that 
V(F) = EV(Y | X) + VE(Y \ X) that is the variance of a random 
variable is equal to the expectation of its conditional variance plus 
the variance of its conditional expectation. 

10. The density function of a continuous random variable X is given by 


/« = 


kx( 2 - x) 
0 


for 0 =£ x < 2 
otherwise 
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(a) Find k. 

(b) Find E(X) and V(X). 

11. Answer Exercise 10 when 


fix) = 


fix, y) 


I kx for 0 s x =£ 1 

k( 2) - x) for 1 < x < 2 
[0 otherwise 

12. Suppose that X and Y are continuous random variables with the joint prob¬ 
ability density function 

\k(x + y) for 0 =£ x == 1, 0 =£ y =£ 2 
[0 otherwise 

(a) Find k, E(X), E(J), V(X), V(Y), and cov(A, Y). Are X and Y inde¬ 
pendent? 

(b) Find the marginal densities of X and Y. 

(c) Find the conditional density of X given Y = { and hence E(X \ Y = 
i)and V(Y\X = §). 

13. Answer Exercise 12 if/(x, y) is defined as follows: 

f/c(l - x)(2 - y) for 0 s jr < 1, 0 < y < 2 
[0 otherwise 

14. Let dependent random variables X, Y, and Z be defined by the joint distri¬ 
bution: 


fix, y) 


P(X = 1, Y = 2, Z = 3) = 0.25 

P(X = 2, Y = 1,Z = 3) = 0.35 

P(X = 2, Y = 3, Z = 1) = 0.40 

In this case, P(X < Y) = 0.65, P(X < Z) = 0.6, and P( Y< Z) = 0.6, which 
shows that Z is the largest. However, direct observation shows that 
P(X = minimum of X, Y, Z) = 0.25 
P(Y = minimum of X, Y, Z) = 0.35 
P{Z = minimum of X, Y, Z) = 0.40 

which shows that Z is most likely to be the smallest. What can you con¬ 
clude? (See C. R. Blyth, Journal of the American Statistical Association, 
Vol. 67, 1972, pp. 364-366 and 366-381.) 

15. The number of hot dogs sold at a football stand has the following proba¬ 
bility distribution: 


X 

Probability 

800 

0.05 

900 

0.10 

1000 

0.25 

1100 

0.35 

1200 

0.10 

1300 

0.10 

1400 

0.05 
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The hot dog vendor pays 30 cents for each hot dog and sells it for 45 cents. 
Thus for every hot dog sold, he makes a profit of 15 cents, and for every 
hot dog unsold, he loses 30 cents. What are the expected value and vari¬ 
ance of his profit if the number of hot dogs he orders is: (a) 1100; (b) 1200; 
and (c) 1300? If he wants to maximize his expected profits, how many hot 
dogs should he order? 

16. An instructor wishes to “grade on the curve.” The students’ scores seem 
to be normally distributed with mean 70 and standard deviation 8. If the 
instructor wishes to give 20 percent A’s, what should be the dividing line 
between an A grade and a B grade? 

17. Suppose that you replace every observation x by y = 3x + 7 and the mean 
p by t} = 3|x + 7. What happens to the /-value you use? 

18. ( Reading the N, t, \ 2 , F tables) 

(a) Given that X ~ N( 2, 9), find P(2 < X < 3). 

(b) If X ~ / 20 , find x, and x 2 such that 

P(X < x,) = 0.95 
P(—x | < X < x,) = 0.90 
P(x x < X < x 2 ) = 0.90 

Note that in the last case, we have several sets of x, and x 2 . Find three sets. 

(c) If X ~ x 2 io> And x, and x 2 such that P(X < x,) = 0.95 and P(X > x,) 
= 0.95. Find two values x, and x 2 such that P(x x < X < x 2 ) = 0.90. 

Note again that we can have several sets of x, and x 2 . Find three sets. 

(d) If X ~ F 2 20 , find x, such that P(X > x,) = 0.05 and P(X > x,) = 0.01. 
Can you find x, if you are told that P(X > x,) = 0.50? 

19. Given that y = e x is normal with mean 2 and variance 4, find the mean and 
variance of x. 

20. Let x„ x 2 , . . . , x„ be a sample of size n from a normal distribution /V(|x, 
a 2 ). Consider the following point estimators of p: 


Pi = x, the sample mean 


\x 2 — X] 



1 

2 (« - 1 ) 


(x 2 + x 3 + • • • + x„) 


(a) Which of these are unbiased? 

(b) Which of these are consistent? 

(c) Find the relative efficiencies: p, to p 2 , p! to p 3 , and p 2 to p 3 . What 
can you conclude from this? 

(d) Are all unbiased estimators consistent? 

(e) Is the assumption of normality needed to answer parts (a) to (d)? For 
what purposes is this assumption needed? 

21. In exercise 20 consider the point estimators of a 2 : 


*2 

Ol 


2 c*. - x ) 2 
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62 = - S (•*, - A ") 2 
n ,= 1 

a] = (x, - x) 2 

Which of these estimators are 

(a) Unbiased? 

(b) Consistent? 

Is the assumption of normality needed to answer this question? For what 
purposes is the normality assumption needed? 

22. Suppose that a is an estimator of a derived from a sample of size T. We 
are given that E{a) = a + 2 IT and var(a) = 4a IT + a 2 IT 2 . 

(a) Examine whether as an estimator of a, & is (1) unbiased, (2) consis¬ 
tent, (3) asymptotically unbiased, and (4) asymptotically efficient. 

(b) What is the asymptotic variance of a? 

23. Explain, using the estimators for p, in Exercise 20, the difference between 
lim E, AE, and plim. Give some examples of how they differ. 

24. Examine whether the following statements are true or false. Explain your 
answer briefly. 

(a) The null hypothesis says that the effect is zero. 

(b) The alternative hypothesis says that nothing is going on besides 
chance variation. 

(c) A hypothesis test tells you whether or not you have a useful sample. 

(d) A significance level tells you how important the null hypothesis is. 

(e) The E-value will tell you at what level of significance you can reject 
the null hypothesis. 

(f) Calculation of E-values is useless for significance tests. 

(g) It is always better to use a significance level of 0.01 than a level of 
0.05. 

(h) If the E-value is 0.45, the null hypothesis looks plausible. 

25. Examine whether the following statements are true or false. If false, cor¬ 
rect the statement. 

(a) With small samples and large a, quite large differences may not be 
statistically significant but may be real and of great practical signifi¬ 
cance. 

(b) The conclusions from the data cannot be summarized in the E-value. 
Conclusions should always have a practical meaning in terms of the 
problem at hand. 

(c) The power of a test at the null hypothesis //„ is equal to the signifi¬ 
cance level. 

(d) In practice the alternative hypothesis H x has no important role ex¬ 
cept in deciding what the nature of the rejection region should be 
(left-sided, right-sided, or two-sided). 

(e) If you are sufficiently resourceful, you can always reject any null 
hypothesis. 

26. Define type I error, type II error, and power of a test. What is the relation- 
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ship between type I error and the confidence coefficient in a confidence 
interval? 

27. In each of the following cases, set up the null hypothesis and the alterna¬ 
tive: Explain how you will proceed testing the hypothesis. 

(a) A biscuit manufacturer is packaging 16-oz. packages. The production 
manager feels that something is wrong with the packaging and that 
the packages contain too many biscuits. 

(b) A tire manufacturer advertises that its tires last for at least 30,000 
miles. A consumer group does not believe it. 

(c) A manufacturer of weighing scales believes that something is going 
wrong in the production process and the scale does not show the 
correct weight. 

In each of the problems above, can you identify the costs of mistaken de¬ 
cisions if we view the hypothesis-testing problem as a decision problem? 

28. An examination of sample items from a shipment showed that 51% of the 
items were good and 49% were defective. The company president asked 
the statistician, “What is the probability that over half the items are good?’’ 
The statistician replied that the question cannot be answered from the data. 
Is this correct? Does the question make sense? Explain why. 

29. A local merchant owns two grocery stores at opposite ends of town. He 
wants to determine if the variability in business is the same at both loca¬ 
tions. Two independent random samples yield 

n, = 16 days n 2 = 16 days 

s, = $200 $2 = $300 

(a) Is there enough evidence that the two stores have different variability 
in sales? 

(b) The merchant reads in a trade magazine that stores similar to his 
have a population standard deviation in daily sales of $210 and that 
stores with higher variability are badly managed. Is there any evi¬ 
dence to suggest that either one of the two stores the merchant owns 
has a standard deviation of sales greater than $210? 

30. A stockbroker who wants to compare mean returns and risk (measured by 
variance) of two stocks and gets the following results: 


First stock 
71, = 31 
x, = 0.45 
s, = 0.60 


Second stock 
n 2 = 31 
* 2 = 0.35 
s 2 = 0.40 


Are there any significant differences in the mean returns and risks? (As¬ 
sume that daily price changes are normally distributed.) 

31. If p has a uniform distribution in the range (0, 1) show that —2 log, p has 
a x 2 -distribution with degrees of freedom 2. If there are k independent tests, 

b 

each with a /?-value, then k = — 2 2 log p, has a x 2 distribution with d.f. 

1=1 
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2k. This statistic can be used for an overall rejection or acceptance of the 
null hypothesis based on the k independent tests. [See G. S. Maddala, 
Econometrics (New York: McGraw-Hill, 1977), p. 48. The test is from C. 
R. Rao, Advanced Statistical Methods in Biometric Research (New York: 
Wiley, 1952), p. 44.] 

32. The weekly cash inflows (x) and outflows ( y ) of a business firm are random 
variables. The following data give values of x and y for 30 weeks. Assume 
that x and y are normally distributed. 


X 

y 

X 

y 

X 

y 

42 

25 

70 

39 

93 

20 

65 

37 

82 

36 

86 

68 

76 

83 

90 

82 

68 

72 

92 

36 

68 

30 

53 

60 

37 

73 

82 

72 

87 

65 

47 

23 

28 

39 

63 

80 

27 

97 

61 

27 

47 

62 

23 

36 

75 

38 

52 

36 

63 

70 

83 

27 

38 

43 

40 

51 

60 

78 

90 

57 


(a) Obtain unbiased estimates of the means and variances of x, y, and x 
— y. Also obtain 95% confidence intervals for these six variables. 

(b) Test the hypothesis p,* > assuming that: 

(1) x and y are independent. 

(2) x and y are correlated. 


Appendix to Chapter 2 

Matrix Algebra 

In this appendix we present an introduction to matrix notation. This will facil¬ 
itate the exposition of some material in later chapters—in particular. Chap¬ 
ter 4. 

A matrix is a rectangular array of elements; for example, 

'2 17 4' 

12-21 
.4 1 2 — 3_ 

We shall denote this by A and say that it is of order 3x4. The first number is 
the number of rows, and the second number is the number of columns. A ma¬ 
trix of order 1 x « is called a row vector, and a matrix of m x 1 is called a 

2 
8 

-10 


column vector; for example, b = (1, 2, 7, 3) is a row vector and c = 


is 
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a column vector. Henceforth we shall follow the convention that of writing all 
column vectors without a prime and writing row vectors with a prime; for ex¬ 


ample, if b 


1 

2 

7 

3 


is a column vector, b' = [1, 2, 7, 3] is a row vector. 


A transpose of a matrix, A, denoted by A', is the same matrix A with rows 
and columns interchanged. In the example above. 


A' = 


2 1 4“ 

1 2 1 

7-2 2 

4 1 -3 


We shall now define matrix addition, subtraction, and multiplication. Matrix 
addition (or subtraction) is done by adding (subtracting) the corresponding ele¬ 
ments and is defined only if the matrices are of the same order. If they are not 
of the same order, then there are no corresponding elements. For example, if 


then 


A = 


1 2 7 
1 0 6 


A + B 


4 6 9 

-4 -7 16 


and 


B = 


3 4 2 

-3 -7 10 


and 


A - B = 


-2 -2 5 

2 7-4 


Obviously, A + B' is not defined because A is of order 2x3 and B' is of order 
3x2. Also note that A + B = B + A. As we shall see later, this commutative 
law does not hold for matrix multiplication; that is, AB and BA need not be 
defined, and they need not be equal even if they are defined. 


Multiplication of a Matrix by a Scalar 

This is done by multiplying each element of the matrix by the scalar. For in¬ 
stance, if 


2 3 1 




0 1 7 


then 


4A = 


8 12 4 

0 4 28 


Matrix Equality 

Two matrices A and B are said to be equal if they are of the same order and 
they have all the corresponding elements equal. In this case, A — B = 0 (a 
matrix with all elements zero; such a matrix is known as a null matrix ). 

Scalar Product of Vectors 

If b and c are two vectors of the same order, so that b' = (b u b 2 , ... , b„) and 
c' = (c„ c 2 , ... , c„), the scalar product of the two vectors is defined as b'c = 
b,c, + b 2 c 2 + • • • + b„c„. The multiplication is row-column wise and is 
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achieved by multiplying the corresponding elements and adding up the result. 
For example, if b' = (2, - 1, 2) and c' = (0, 3, 3) then b'c = [(2)(0) + (- 1)(3) 
+ (2)(3)] = 0- 3 + 6 = 3. The scalar product is not defined if the vectors 
are not of the same order. 

Matrix Multiplication 

A matrix can be considered as a series of row vectors or a series of column 
vectors. Matrix multiplication is also done row-column wise. If we have two 
matrices B and C and we need the product BC, we write B as a series of row 
vectors and C as a series of column vectors and then take scalar products of 
the row vectors in B and the column vectors in C. Clearly, for this to be possible 
the number of elements in each of the rows of B should be equal to each of the 
columns of C. If B is of order m x n and C is of order n x k, BC is defined 
because the number of elements in the row vectors in B and the column vectors 
in C are both n. But if B is n x k and C is m x n, BC is not defined. If B is an 
m x n matrix and C is an n x k matrix, we can write B as a set of m row 
vectors and C as a set of k column vectors; each of order n. That is. 



b\ 





B = 

b' 2 


and 

C = 

[+i> c 2 , . 


b' J 





Then BC is defined as 




b' ,c 2 

b\c k ~ 


BC 

= 

b' 2 Ci 

b’ 2 c 2 

b' 2 c k 




_b' m ci 

b' m c 2 

b' m c k _ 


BC is of order m x k. It has as many rows as B and as many columns as C. As 
an example, consider 


B = 


2 1 3 
0 1 2 


and 


C = 


7 5 

8 6 


B is of order 2x3 and C of order 2x2. Hence BC is not defined. But B'C is 
defined. 


B'C 



(2X7) + (0X8) (2X5) + 
(1)(7) + (1X8) (1)(5) + 
_(3)(7) + (2X8) (3X5) + 

Note that BC is not defined but CB is defined. 



'14 

1- 

o 

ll 

15 

11 


37 

27_ 


CB 


7 5 1 

_ 2 

1 

3" 


8 6j 

0 

1 

2 



14 

16 


12 

14 


31 
36 I 


Note: Given two matrices, B and C, one or both of the products BC and CB 
may not be defined, and even if they are both defined, they may not be of the 
same order, and even if they are of the same order, they may not be equal. For 
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instance, if B is 2 x 3 and C is 3 x 2, then BC is defined and of order 2x2, 
and CB is also defined but is of order 3 x 3. Suppose that B = 


2 ol 

3 6 

J j and C = 

1 3 


; then BC and CB are both of order 2x2, but they 


are not equal. We have 
BC = 


6 12 
4 9 


and CB = 


12 6 
5 3 


Because of all these complications, when we say that a matrix B is multiplied 
by a matrix C, we have to specify whether B is premultiplied by C so that we 
have CB or whether it is postmultiplied by C so that we have BC. 


Reversal Law for Transpose of a Product 

If B and C are matrices such that BC is defined, then (BC)' = C'B'. This result 
can easily be verified and hence we shall not prove it. 


Identity Matrix 

An n x n matrix with 1 in the diagonal and zeros elsewhere is called an identity 
matrix of order n and is denoted by I„. For example. 


h 


1 0 0 
0 1 0 
0 0 1 


The identity matrix plays the same role in matrix multiplication as the number 
1 in scalar multiplication, except that the order of I„ must be defined properly 
for pre- and postmultiplication; for example, if B is 3 x 4, then I 3 B = B and 

BI 4 = B. 


Inverse of a Matrix 

The inverse of a square matrix A, denoted by A -1 , is a matrix such that A"‘A 
= A A 1 = I. This is analogous to the result in ordinary scalar multiplication 
yy 1 = y~ x y — 1. To find A 1 given A, we need to go through afew results on 
determinants. 

Determinants 

Corresponding to each sauare matrix A there is a scalar value known as the 
determinant, which is denoted by |A|. There are n 2 elements in A, and the de¬ 
terminant of A is an algebraic function of these n 2 elements. Formally, the def¬ 
inition is 

|A| (—1^1* 0 - 2 * 

where the second subscript is a permutation of the numbers (1,2,. . . , n). The 
summation is over all the (n!) permutations, and the sign of any element is 
positive if it is an even permutation and negative if it is an odd permutation. 
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The permutation is odd (even) if we need an odd (even) number of interchanges 
in the elements to arrive at the given permutation. For example, if we have a 3 
x 3 matrix, 



We have to consider permutations of the numbers (1, 2, 3). The permutation 
(3, 2, 1) is odd because we need one interchange. The permutation (3, 1, 2) is 
even because we need two interchanges. There are (3!) = 6 permutations in 
all. These are, with the appropriate signs: +(1, 2, 3), — (1, 3, 2), —(2, 1, 3), 
+ (2, 3, 1), -(3, 2, 1), +(3, 1, 2). Hence we have 

IAf = #ii<?22 a 33 — a \\ a n a yi ~ a n a 2\ a n + &n a 23 a 3\ ~ a \3 a i2 a 3\ a n a i\ a n 

Note that the first subscripts are always (1, 2, 3). We can write this in terms of 
the elements of the first row as follows: 

|A| = «n(«22«33 — a 23 a 32 ) + a n(~ a 2t a 33 -b a 23 a 3l) + a l3i a 2l a 32 ““ a 22 a 3l) 

The terms in parentheses are called the cofactors of the respective elements. 
Thus if we denote the cofactor of a u by A n , of a n by A n , and of o l3 by A n , we 
have 

|A| = 0|,yl|] + a n A i2 + Ui 3 A 13 

We can as well write |A| in terms of the elements and the corresponding cofac¬ 
tors of any other row or column. For example, if we consider the third column, 
we can write 

|A| = a n A n + O23A23 + ^33 A 33 


The cofactor is nothing but the determinant of a submatrix obtained by deleting 
that row and column, with an appropriate sign. For example, if we want to get 
the cofactor for a B , we delete the first row and third column, calculated the 
value of the determinant of the 2 x 2 matrix, and multiply it by (-1) 1+3 = 1. 


To get the cofactor of a 0 in a matrix A, we delete the z'th row and jth column in 
A, compute the determinant, and multiply it by (- 1 )' 1j . For a 2 x 2 matrix A 


a u a l2 

<h\ a n y 


we have |A| 


Properties of Determinants 

There are several useful properties of determinants that can be derived by just 
considering the definition of a determinant. 

1. If A* is a matrix obtained by interchanging any two rows or columns of 
A, then |A*| = -|A|. This is because all odd permutations become even and 
even permutations become odd—with one extra interchange. For example, 


1 

3 

7 

2 

0 

1 

2 

0 

1 

= - 1 

3 

7 

5 

1 

8 

5 

1 

8 
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2. From property 1 it follows that if two rows (or columns) of a matrix A are 
identical, then |A| = 0 because by interchanging these rows (or columns) the 
value is unaltered. Thus |A| = |A*| = — |A|. Hence |A| = 0. For example, 

1 3 7 
13 7=0 
5 1 8 

3. Expansion of a determinant by “alien” cofactors is equal to zero. By 
“alien” cofactors, we mean cofactors of another row or column. For example, 
consider the 3x3 matrix. The first row is ( a n , a n , a l3 ) with cofactors (A,„ 
A, 2> A 13 ). The second row is (<r/ 2 i» a n> a n) with cofactors (A 2I , A 22 , A 23 ). Then 

a,,.<4,1 + a 12 A u + fl, 3 A, 3 = expansion by own cofactors 
a„A 2l + «| 2 A 22 + a j3 A 23 = expansion by alien cofactors 

We know that the first expansion is |A|. The second expansion would be a cor¬ 
rect expansion if a u = a 2 „ a u = a 22 , and a l3 = a 23 , that is, if the first and 
second rows of A are identical. But we know from property 2 that in this case. 
|A| = 0. Hence we have 

<z uA 2 i + 0 i 2 A 22 + a iyA 23 = 0 

or expansion by alien cofactors is zero. 

4. The value of a determinant is unaltered by adding to any of its rows (or 
columns) any multiples of other rows (or columns). For example, consider add¬ 
ing three times the second row to the first row for a 3 x 3 matrix. We then 
have, expanding by the elements and cofactors of the first row, 

<2,, 4" 3«21 ^12 3“ 3<2 22 21, 3 3“ 3<2 23 

<j 2 i <2 22 *723 = (<4,, + 3u 21 )A,, + (o,2 3- 3a 22 )A l2 

a 3l a 32 °33 

+ (<i, 3 + 3 <j 23 )A, 3 = (<2,,A|[ + u,2A,2 3- <j, 3 A, 3 ) + 3(<i 2 iA,, + <2 22 A, 2 + <i 23 A, 3 ) 

The first term in parentheses is |A|, and the second is zero, because it is an 
expansion by alien cofactors. 

This property is useful for evaluating the values of determinants. For in¬ 
stance, consider 

2 8 16 
|A| = 3 4 10 

-12 3 

Now consider (row 1) - (row 2) - 2(row 3). The value of the determinant is 
unaltered. But we get 

1 0 0 
3 4 10 
-12 3 

Now expand by the elements of the first row. It is easy because we have two 
zeros. We get 
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|A[ = 1 • 2 ™ = 12 - 20 = -8 

5. From property 4 we can deduce the important property that if a row (or 
column) of a matrix A can be expressed as a linear combination of the other 
rows (or columns), the determinant of A is zero. For example, consider 



Here the third column of A = (first column) + 2(second column). In such cases 
we say that the columns are linearly dependent. When we evaluate |A| we can 
subtract from column 3, (column 1 + 2 column 2) and get all zeros in column 
3. Then the expansion by the elements of column 3 gives us |A| = 0. From this 
property we can derive the following theorem: 

Theorem: The determinant of a matrix A is nonzero if and only if 
there are no linear dependencies between the columns (or rows) of 
A. If |A| + 0, A is said to be nonsingular. If |A| = 0, A is said to be 
singular. 

Note that each row (or column) of a matrix is a vector. A set of vectors is said 
to be linearly independent if none can be expressed as a linear combination of 
the rest. For example, (1, — 1, 0), (2, 3, 1), and (4, 1, 1) are not linearly inde¬ 
pendent because the third vector = 2(first vector) + 1 (second vector). Three 
cannot be more than n linearly independent vectors of order n. 

Determinants of the Third Order 

There is a simple procedure for evaluating determinants of the third order. For 
higher-order determinants we have to follow the expansion by cofactors and 
the simplification rule 4. 

Consider 

4 3 3 
3 1 4 

3 4 1 

What we do is append the first two columns to this and take products of the 
diagonal elements. There are three products going down and three products 
going up. The products going down are positive; the products going up are 
negative. The value of the determinant is 76 — 82 = -6. 

9 + 64 + 9 = 82 
4 3 3 4 3 
3 14 3 1 
3 4 13 4 

4 + 36 + 36 = 76 
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As another example, consider 


3 -1 1 

4 2 8 

1 2 3 

We have 


3 -1 

4 2 
1 2 


2 4- 48 - 12 = 38 


13-1 
8 4 2 
3 1 2 


18 - 8 + 8 = 18 

The value of the determinant is 18 - 38 = -20. 


Finding the Inverse of a Matrix 

To get the inverse of a matrix A. we first replace each element of A by its 
cofactor, transpose the matrix, and then divide each element by |A|. For in¬ 
stance, with the 3x3 matrix A that we have been considering, 

An a 12 A 13 T 
A 2] A 22 A 23 
A 31 A 32 A 3 3 _ 



Noting that expansion by own cofactors gives |A| and expansion by alien co¬ 
factors gives zero, we get 


A“*A = 


|A| 


1 

|A| 


A„ 

A 2 1 

a 3 , 

A 12 

A 22 

A 3 2 

A,3 

a 23 

a 33 J 


|A| 0 0 

0 |A| 0 

0 0 |A| 


a ii 

a 12 

an 

«21 

a 22 

°2i 

a 3l 

a l2 

an. 


I 


We can also check that A A 1 = I. If |A| = 0, then A " 1 does not exist. Thus 
for a singular matrix, the inverse does not exist. As an example of computing 
an inverse, consider the matrix 


A = 


3 

4 
1 


-1 

2 

2 


1 

8 

3 


To find the inverse, first we have to find the matrix of cofactors and also |A|. 
The matrix of cofactors is 

"-10 -4 6 " 

5 8-7 

-10 -20 10 
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and |A[ = - 20. Hence 


A" 1 



5 -10 

8 -20 
-7 10 


Reversal Law 

Inverses follow a reversal law, just like transposes. If B and C both have in¬ 
verses, then 


(BC) 1 2 3 = C'B 1 


Orthogonal Matrices 

Two vectors b, and b, are said to be orthogonal if b',b 2 = 0. They are also of 
unit length if b',b, = b' 2 b 2 = 1. A matrix B is said to be an orthogonal matrix 
if its rows are orthogonal and of unit length. We have 





i 

0 • 

• 0 


b i 

[b|, I>2j • • * » 

0 

0 • 

• 0 

B'B = 

b' 2 


Lb'J 


0 

0 • 

1 


= I the identity matrix 
Postmultiplying both sides by B 1 , we get 

B' = B 1 

Thus for an orthogonal matrix, the inverse is just the transpose. Premultiplying 
both sides by B, we get BB' = I. Thus for an orthogonal matrix the rows as 
well as columns are orthogonal. Also, |B'B| = 1. Hence |B| = ± I. 


Rank of a Matrix 

A matrix of order m x n can be regarded as a set of m row vectors or a set of 
n column vectors. The row rank of the matrix is the number of linearly inde¬ 
pendent row vectors. The column rank of the matrix is the number of linearly 
independent column vectors. The row rank is < m and the column rank is < 
n. There are three important results on the rank of a matrix (which we shall 
state without proof). 

1. Row rank = column rank. 

2. If A and B are two matrices such that their product AB is defined, rank 
(AB) is not greater than rank A or rank B. 

3. The rank of a matrix is unaltered by pre- or postmultiplication by a non¬ 
singular matrix. This says that rank is unaltered by taking linear combi¬ 
nations of rows (or columns). 
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Solution of Linear Equations 

For simplicity let us consider a system of three equations in three unknowns: 

«n*i + «12*2 + «b*3 = b \ 

021*1 + 022 *2 + 0 23 *3 = b 2 

o 3 1*1 + a n x 2 + «33*C 3 = bi 

We can write them compactly in matrix notation as Ax = b, where 



a \\ 

a \2 

a \y 


*1 


b\ 

A = 

a 2i 

a 22 

a 2i 

X = 

*2 

b = 

b 2 


_ fl 31 

a n 

033- 


_*3_ 


by. 


If b = 0, the system of equations is said to be homogeneous. In this case we 
have Ax = 0. This says that the vector x is orthogonal to the row vectors of A. 
But since A is 3 x 3 and there can be at most three linearly independent vectors 
of order 3, a necessary and sufficient condition for the existence of a nonzero 
solution to the set of equations is rank A < 3. In the general case of n equations, 
the condition is rank A < n. If rank A = r < n, there are (n - r ) linearly 
independent solutions x that satisfy the equations Ax = 0. 

Now consider the case of the nonhomogeneous equations: Ax = b. We can 
write these equations as follows: 



l 

a 


an 


_ 1 


V 

*i 

a 2 \ 

4- X 2 

a 22 

+ x 3 

a jy 

— 

b 2 


_ 0 3 |_ 


_O 32 _ 


_o 33 _ 


A- 


What this says is that b is a linear combination of the columns of A. Hence a 
necessary and sufficient condition for the existence of a solution to this set is: 
Rank(A) = Rank (A|b). 


Cramer’s Rule 

There is one convenient rule for solving a system of non-homogeneous equa¬ 
tions. This rule, known as Cramer’s rule is as follows. Consider the system of 
non-homogeneous equation Ax = b where x' = (x„ x 2 , . . . *„). Let us denote 
by A,, the matrix A with the first column replaced by the vector b. Similarly, 
A 2 denotes the matrix A with the second column replaced by the vector b. We 
define A 3 , A 4 . . . A„ similarly. Then Cramer’s rule says: 

*i = I Ail -s- |A|, x 2 — |A 2 | -i- |A|, * 3 = |A 3 | -f- |A| 

and so on. As an example, consider the system of equations: 

4x , + 3jc 2 + x 3 = 13 
- * 2 + x 3 = 2 
2*, - * 2 + 3x 3 = 9 
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Then 


|A| — 
|AJ- 
|A 2 | = 


IAJ- 


4 

1 

2 

13 

2 

9 

4 

1 

2 

4 

1 

2 


3 1 
-1 1 
-1 3 

3 1 
-1 1 
-1 3 

13 1 
2 1 
9 3 

3 13 
-1 2 
-1 9 


« -10 

= -10 

= -20 

= -30 


Hence, = -10/-10 = l,x 2 = -20/-10 » 2, * -30/-10 = 3. 

Another example we shall consider is that of the existence of a solution for 
which we require Rank (A) = Rank (A|b). Consider the question: For what 
value of c will the following equations admit a solution? (Note that this is a 
case where Cramer’s rule breaks down because |A| = 0.) 

2x t - x 2 + 5x } = 4 
4jc, + 6xj — 1 
-2x 2 + 4x 3 = 7 + c 

Note that what we need to show is 


Rank 

'2 -1 

4 0 

5“ 

6 

= Rank 

'2 -1 

4 0 

5 4 

6 1 


0 -2 

4 


_0 -2 

4 7 + c 


Since rank is unaltered by taking linear combinations of rows (or columns), 
subtract 2 (row 1) from row 2. We get 

'2-1 5 4 

0 2-4-7 

0 -2 4 7 + c 


Now add row 2 to row 3. We get 

'2 -1 5 4" 
0 2-4-7 

_0 0 0 c 

Now 


2 -1 

0 2 

0 0 



= 2 


Rank(A) = Rank 
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because the third row has all zeros. 


Rank(A|b) = Rank 


2 

0 

0 


-1 

2 

0 



only if c = 0 


Hence the required answer is c = 0. 


Linear and Quadratic Forms 

Suppose that we have the vectors a, x, and the matrix A defined as 

(a\ /x A la n a l2 a l3 \ 

a = I a 2 I x = I * 2 I A = I a 2l a 22 a 23 I 

\ a 3/ \ X i/ \ a n fl 32 a n/ 

Then L = a'x = a l x l + a 2 x 2 + a 3 x 3 is said to be a linear form in x’s: 

Q = x'Ax = a n x i + a i2 x t x 2 + a, 3 * 1*3 + a 2 3 x t x 2 

+ a 22 x\ + a 23 x 2 x 3 + a 3X x 3 x 3 + a 32 x 2 x 3 + a 33 x\ 

is called a quadratic form in x’s. The generalization to the case of n x’s is 
obvious. In subsequent chapters we shall need differentiation of the linear func¬ 
tion L and the quadratic function Q with respect to the x's. 

Note that dLld jc, = a,, dLldx 2 = a 2 , and dLldx 3 = a 3 . We shall denote the 
vector of partial derivatives 

" dL ~ 

dx, 

0 L 

dX 2 

dL 

— _ 

by dL/dx. Thus we have dL/dx — a. Also, 

— = {a n x x + a n x 2 + a i3 x 3 ) + (a„x, + a 2i x 2 + a 3i x 3 ) 

uXj 

with similar expressions for dQldx 2 and dQldx 3 . Collecting these and writing in 
matrix notation, we get 

— = Ax + A’x = 2Ax if A is symmetric 
dx 

Thus we get 

•j 'i 

—(a'x) = a and —(x'Ax) = (A + A')x 
‘ dx dx 

We shall use this result in the Appendix to Chapter 4. 
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Covariance Matrix of a Set of Random Variables 

Let x' = {x u x 2 , ... , x„) be a set of n independent random variables with mean 
zero and common variance cr 2 . Earlier we defined x'x as a scalar product. This 

ft 

is equal to X xj. If we consider xx', this will be an n x n matrix. The covari- 

i 

ance matrix of the variables is (since their mean is 0) 


Since 




*1*2 • 

’ ’ *1*„ 

«5 

II 

hi 

*1*2 

*1 • 

’ ’ *2*n 


_*!*„ 

* 2 *„ ’ ' 

* • JC 2 

-* n ___ 


£(xx') 


£(*,*;) = 



if / =7 
if / ^ j 


we have V = la 2 . In the general case where E( jc,) = p, and cov(jt,x,) = a lP we 
have E(x) = jji, where pi is the vector of means, and the covariance matrix is 
V = E(x - pi)(x — pi)' = X, an n x n matrix whose (/J)th element is cr,,-. 


Positive Definite and Negative Definite Matrices 

In the case of scalars, a number >> is said to be positive if v > 0, negative if y 
< 0, nonnegative if y > 0, and nonpositive if y < 0. In the case of matrices, 
the corresponding concepts are positive definite, negative definite, positive 
semidefinite, and negative semidefinite, respectively. Corresponding to a 
square matrix B we define the quadrative form Q = x'Bx, where x is a nonnull 
vector. Then: 


B is said to be positive definite if Q > 0. 

B is said to be positive semidefinite if Q > 0. 

B is said to be negative definite if Q < 0. 

B is said to be negative semidefinite if Q s 0. 

All these relations should hold for any nonnull vector x. 

For a positive definite matrix B. leading (diagonal) determinants of all orders 
are > 0. In particular, the diagonal elements are > 0 and |B| > 0. For example, 
consider the matrix 



The diagonal elements are 3, 2, and 7, all positive. The diagonal determinants 
of order 2 are 


3 1 

4 2 


= 10 , 


3 

6 



and 


2 2 
-4 7 


22 


which are all positive. Also, |B| = 94 > 0. Hence B is positive definite. 
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As yet another example, consider the quadratic form 


Q — 4x] + 9xj + 2*3 + 6x t x 2 + 6x,x 2 + 6x,x 3 + 8x 2 x 3 

Is this positive for all values of x„ x 2 , and x 3 ? To answer this question, we write 
Q — x'Bx. The matrix B is given by 



3 3 
9 4 

4 2 


The diagonal terms are all positive. As for the three leading determinants of 
order 2, they are 


4 3 
3 9 


9 4 
4 2 


, and 


4 3 
3 2 


The first two are positive, but the last one is not. Also, |B| = - 19. Hence B is 
not positive definite. The answer to the question asked is “no.” 

For a negative definite matrix all leading determinants of odd order are < 0 
and all leading determinants of even order are > 0. For semidefinite matrices 
we replace > 0 by > 0 and < 0 by < 0. For example, the matrix 


3 1 -3 

-4 0 2 

6-4 7 


is positive semidefinite. Suppose that A is an m x n matrix. Then A'A and AA' 
are square matrices of orders n x n and m x m, respectively. We can show 
that both these matrices are positive semidefinite. 

Consider B = AA'. Then x'Bx = x'AA'x. Define y = A'x. Then x'Bx = y'y 
= 2 yh which is > 0. Hence x'Bx > 0; that is, B is positive semidefinite. 

Finally, consider two positive semidefinite matrices B and C. We shall write 
B > CifB — C is positive semidefinite. Consider 


Then 



2 

1 

1 


b - a ' a - D d 

are both positive definite. 


and 


C = AA' = 


5 2 
2 1 
4 1 


4 
1 

5 


The Multivariate Normal Distribution 

Let x' = (x,, x 2 , , x n ) be a set of n variables which are normally distributed 

with mean vector pi and covariance matrix V. Then x is said to have an n- 
variate normal distribution N n ( pi, V). Its density function is given by 

= ( 2 ~ } i ] v p ex P K* - F)'V-'(x - fx)j 
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Note that with n = 1, we have V = a 2 and we can see the analogy with the 
density function for the univariate normal we considered earlier. 

Consider the simpler case p, = 0. Also make the transformation 

y = V _1 ' 2 X 

Then the y’s are linear functions of the x’s and hence have a normal distribu¬ 
tion. The covariance matrix of the y’s is 

E{ yy') = fi|V- ,/2 xx'V- 1/2 
= V- |/2 E(xx')V- |/2 

— y-i/2yy-i/2 _ | 

Thus the y’s are independent normal with mean 0 and variance 1. Hence 
has a x 2 distribution with degrees of freedom n. But 

2 y? = y'y = x’V -1/2 V -1/2 x = x'V -1 X 

Hence we have the result 

\fx~N n (0, V), then x'V^x ~ x 2 


Idempotent Matrices and the \ 2 -Distribution 

A matrix A is said to be idempotent if A 2 = A. For example, consider A = 
X(X'X)“>X'. Then A 2 = X(X'X)“'X'X(X'X)-'X' = X(X'X)X' = A. Thus A is 
idempotent. Such matrices play an important role in econometrics. We shall 
state two important theorems regarding the relationship between idempotent 
matrices and the x 2 -distribution (proofs are omitted). 8 

Let x' = (X|, x 2 , , x„) be a set of independent normal variables with mean 

0 and variance 1. We know that x'x = has a x 2 -distribution with degrees 
of freedom (d.f.) n. But some other quadratic forms also have a x 2 -distribution, 
as stated in the following theorems. 

Theorem 1: If A is an n x n idempotent matrix of rank r, then x'Ax 
has a x 2 -distribution with d.f. r. 

Theorem 2: If A, and A 2 are two idempotent matrices with ranks r, 
and r 2 , respectively, and A]A 2 = 0, then x'A,x and x'A 2 x 
have independent x 2 - distributions with d.f. r, and r 2 , re¬ 
spectively. 

We shall be using this result in statistical inference in the multiple regression 
model in Chapter 4. 

Trace of a Matrix 

The trace of a matrix A is the sum of its diagonal elements. We denote this by 
Tr(A). These are a few important results regarding traces. 


“For proofs, see G. S. Maddala, Econometrics (New York: McGraw-Hill, 1977), pp. 455-456. 
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Tr(A + B) = Tr(A) + Tr(B) 


Tr(AB) = Tr(BA) if both AB and BA are defined. 

These results can be checked by writing down the appropriate expressions. 
As an example, let 


Then 





Tr(AB) = Tr(BA) = 7. Another important result we shall be using (in the case 
of the multiple regression model) is the following 9 : 


If A is an idempotent matrix, then Rank(A) = Tr(A) 


Characteristic Roots and Vectors 

This topic is discusssed in the Appendix to Chapter 7. 


Exercises on Matrix Algebra 

1. r. n 

0 

i ij’ 


Consider A = 


B 


1 0 
0 I 
1 0 


, and C 


2 1 
1 1 


(a) Compute ABC, CAB, BCA, CB'A', C'B'A'. 

(b) Verify that (ABC)' = C'B'A'. 

(c) Find the inverses of these matrices. Verify that (ABC) -1 = 
C -1 (AB) -1 . 

(d) Verify that Tr(BCA) = Tr(ABC) = Tr(CAB). 

2. Solve the following set of equations using matrix methods. 

x, + 2 x 2 + 2x 3 = 1 

2x, + 2x 2 + 3x 3 = 3 

x, - x 2 + 3x 3 = 5 

3. Determine those values of X for which the following set of equations may 
possess a nontrivial solution. 


3x, + x 2 — Xx 3 = 0 

4x, - 2x 2 - 3x 3 = 0 

2Xx, + 4x 2 + Xx 3 = 0 

For each permissible value of X, determine the most general solution. 


‘‘For a proof, see Maddala, Econometrics, pp. 444-445. 
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4. If AB = AC, where A is a square matrix, when does it necessarily follow 
that B = C? Give an example in which this does not follow. 

5. A and B are symmetric matrices. Show that AB is also symmetric if and 
only if A and B are commutative. 

6. If A is a square matrix, the matrix obtained by replacing each element of 
A by its cofactor and then transposing the resulting matrix is known as the 
adjoint of A and is denoted by Adj(A). Note that A" 1 = (1/|A| Adj(A). Show 
that if A and B are square matrices, Adj(AB) = Adj(A) x Adj(B). 

7. In Exercise 1, show that A(A'A)'A' and B(B'B)”'B' are both idempotent. 
What are the ranks of these two matrices? 

8. Determine whether the quadratic form Q = *? + 2x\ + x\ - 2*,* 2 + 
2*2*3 is positive definite or not. Answer the same for 

Q = 2x] + 3*2 + *3 + *4 + 2*,*2 - 2*,*3 + 8*2*3 + 4*2*4 + 4*3*4 

9. (a) Show that the set of equations 

2 *, - 2*2 + * 3 = X *, 

2x, - 3*2 + 2*3 = X* 2 
— *, + 2*2 = X * 3 

can possess a nontrivial solution only if X = 1 or -3. 

(b) Obtain the general solution in each case. 

10. Construct a set of three mutually orthogonal vectors that are linear com¬ 
binations of the vectors (1, 1,0, 1), (1, 1, 0, 0), and (1, 0, 2, 2). 
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3.1 Introduction 


Regression analysis is one of the most commonly used tools in econometric 
work. We will, therefore, start our discussion with an outline of regression 
analysis. The subsequent chapters will deal with some modifications and ex- 
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tensions of this basic technique that need to be made when analyzing economic 
data. 

We start with a basic question: What is regression analysis? Regression anal¬ 
ysis is concerned with describing and evaluating the relationship between a 
given variable (often called the explained or dependent variable) and one or 
more other variables (often called the explanatory or independent variables) 
We will denote the explained variable by y and the explanatory variables b\ 
x 2 , . . . , x k . 

The dictionary definition of “regression” is “backward movement, a retreat, 
a return to an earlier stage of development.” Paradoxical as it may sound, 
regression analysis as it is currently used has nothing to do with regression as 
dictionaries define the term. 

The term regression was coined by Sir Francis Galton (1822-1911) from Eng¬ 
land, who was studying the relationship between the height of children and the 
height of parents. He observed that although tall parents had tall children and 
short parents had short children, there was a tendency for children’s heights to 
converge toward the average. There is thus a “regression of children’s height 
toward the average.” Galton, in his aristocratic way, termed this a “regression 
toward mediocrity.” 

Something similar to what Galton found has been noted in some other studies 
as well (studies of test scores, etc.). These examples are discussed in Section 
3.12 under the heading “regression fallacy.” For the present we should note 
that regression analysis as currently used has nothing to do with regression or 
backward movement. 

Let us return to our notation of the explained variable to be denoted by y 
and explanatory variables denoted by x,, x 2 , , x k . If k = 1, that is, there is 

only one of the x-variables, we have what is known as simple regression. This 
is what we discuss in this chapter. If k > 1, that is, there are more than one x 
variables, we have what is known as multiple regression. This is discussed in 
Chapter 4. First we give some examples. 

Example 1: Simple Regression 
y = sales 

x = advertising expenditures 

Here we try to determine the relationship between sales and advertising expen¬ 
ditures. 

Example 2: Multiple Regression 

y = consumption expenditures of a family 
x, = family income 
x 2 = financial assets of the family 
x 3 = family size 
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Here we try to determine the relationship between consumption expenditures 
on the one hand and family income, financial assets of the family, and family 
size on the other. 

There are several objectives in studying these relationships. They can be 
used to: 

1. Analyze the effects of policies that involve changing the individual x’s. 

In Example 1 this involves analyzing the effect of changing advertising 
expenditures on sales. 

2. Forecast the value of y for a given set of x’s. 

3. Examine whether any of the x’s have a significant effect on y. 

In Example 2 we would want to know whether family size has a significant 
effect on consumption expenditures of the family. The exact meaning of the 
word “significant” is discussed in Section 3.5. 

The way we have set up the problem until now, the variable y and the x’vari¬ 
ables are not on the same footing. Implicitly we have assumed that the x’s are 
variables that influence y or are variables that we can control or change and y 
is the effect variable. There are several alternative terms used in the literature 
for y and x„ x 2 , , x k . These are shown in Table 3.1. 

Each of these terms is relevant for a particular view of the use of regression 
analysis. Terminology (a) is used if the purpose is prediction. For instance, 
sales is the predictand and advertising expenditures is the predictor. The ter¬ 
minology in (b), (c), and (d) is used by different people in their discussion of 
regression models. They are all equivalent terms. Terminology (e) is used in 
studies of causation. Terminology (f) is specific to econometrics. We use this 
terminology in Chapter 9. Finally, terminology (g) is used in control problems. 
For instance, our objective might be to achieve a certain level of sales (target 
variable) and we would like to determine the level of advertising expenditures 
(control variable) to achieve our objective. 

In this and subsequent chapters we use the terminology in (c) and (d). Also, 
we consider here the case of one explained (dependent) variable and one ex- 


Table 3.1 Classification of Variables in Regression 
Analysis 


y 


x,, x z , .... x k 


(a) 

(b) 

(c) 

(d) 

(e) 

(f) 

(g) 


Predictand 
Regressand 
Explained variable 
Dependent variable 
Effect variable 
Endogenous variable 
Target variable 


Predictors 

Regressors 

Explanatory variables 
Independent variables 
Causal variables 
Exogenous variables 
Control variables 



62 


3 SIMPLE REGRESSION 


planatory (independent) variable. This, as we said earlier, is called simple 
regression. Further, as we said earlier, the variables y and x are not treated 
on the same footing. A detailed discussion of this issue is postponed to Chap¬ 
ter 11. 


3.2 Specification of the Relationships 


As mentioned in Section 3.1, we will discuss the case of one explained (depen¬ 
dent) variable, which we denote by y, and one explanatory (independent) vari¬ 
able, which we denote by x. The relationship between y and x is denoted by 

y=m (3.1) 


where f(x) is a function of x. 

At this point we need to distinguish between two types of relationships: 

1. A deterministic or mathematical relationship. 

2. A statistical relationship which does not give unique values of y for given 
values of x but can be described exactly in probabilistic terms. 

What we are going to talk about in regression analysis here is relationships 
of type 2, not of type 1. As an example, suppose that the relationship between 
sales y and advertising expenditures x is 

y = 2500 + 100* - x 2 

This is a deterministic relationship. The sales for different levels of advertis¬ 
ing expenditures can be determined exactly. These are as follows: 


X 

V 

0 

2500 

20 

4100 

50 

5000 

100 

2500 


On the other hand, suppose that the relationship between sales y and adver¬ 
tising expenditures x is 

y = 2500 + 100* - x 2 + it 

where u = + 500 with probability \ 

= - 500 with probability \ 

Then the values of y for different values of * cannot be determined exactly but 
can be described probabilistically. For example, if advertising expenditures are 
50, sales will be 5500 with probability | and 4500 with probability |. 

The values of y for different values of x are now as follows: 



3.2 SPECIFICATION OF THE RELATIONSHIPS 


63 


X 

V 

0 

2000 or 3000 

20 

3600 or 4600 

50 

4500 or 5500 

100 

2000 or 3000 


The data on > that we observe can be any one of the 8 possible cases. For 
instance, we can have 


X 

y 

0 

2000 

20 

4600 

50 

5500 

100 

2000 


If the error term u has a continuous distribution, say a normal distribution 
with mean 0 and variance 1. then for each value of x we have a normal distri¬ 
bution for y and the value of y we observe can be any observation from this 
distribution. For instance, if the relationship between y and x is 

y = 2 + x + u 

where the error term u is N(0, 1), then for each value of x, y will have a normal 
distribution. This is shown in Figure 3.1. The line we have drawn is the deter- 



Figure 3.1. A stochastic relationship. 


X 
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ministic relationship y = 2 + x. The actual values of y for each x will be some 
points on the vertical lines shown. The relationship between y and x in such 
cases is called a stochastic or statistical relationship. 

Going back to equation (3.1), we will assume that the function f(x) is linear 
in x, that is, 


fix) = a + fix 

and we will assume that this relationship is a stochastic relationship, that is, 

y = a + (3* + u (3.2) 

where u, which is called an error or disturbance, has a known probability dis¬ 
tribution (i.e., is a random variable). 

In equation (3.2), a + fix is the deterministic component of y and u is the 
stochastic or random component, a and (3 are called regression coefficients or 
regression parameters that we estimate from the data on y and x. 

There is no reason why the deterministic and stochastic components must 
be additive. But we will start our discussion with a simple model and introduce 
complications later. For this reason we have taken f(x) to be a linear function 
and have assumed an additive error. Some simple alternative functional forms 
are discussed in Section 3.8. 

Why should we add an error term m? What are the sources of the error term 
u in equation (3.2)? There are three main sources: 

1. Unpredictable element of randomness in human responses. For instance, 
if y = consumption expenditure of a household and x = disposable in¬ 
come of the household, there is an unpredictable element of randomness 
in each household’s consumption. The household does not behave like a 
machine. In one month the people in the household are on a spending 
spree. In another month they are tightfisted. 

2. Effect of a large number of omitted variables. Again in our example x is 
not the only variable influencing y. The family size, tastes of the family, 
spending habits, and so on, affect the variable y. The error u is a catchall 
for the effects of all these variables, some of which may not even be 
quantifiable, and some of which may not even be identifiable. To a certain 
extent some of these variables are those that we refer to in source 1. 

3. Measurement error in y. In our example this refers to measurement error 
in the household consumption. That is, we cannot measure it accurately. 
This argument for u is somewhat difficult to justify, particularly if we say 
that there is no measurement error in x (household disposable income). 
The case where both y and x are measured with error is discussed in 
Chapter 11. Since we have to go step by step and not introduce all the 
complications initially, we will accept this argument; that is, there is a 
measurement error in y but not in x. 
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In summary, the sources of the error term u are: 

1. Unpredictable element of randomness in human response. 

2. Effect of a large number of variables that have been omitted. 

3. Measurement error in y. 

If we have n observations on y and jc, we can write equation (3.2) as 

y, = a + 0x, + u, i = 1, 2, . . . , n (3.3) 

Our objective is to get estimates of the unknown parameters a and (3 in equa¬ 
tion (3.3) given the n observations on y and jc. To do this we have to make some 
assumptions about the error terms u,. The assumptions we make are: 

1. Zero mean. E{Uj} = 0 for all i. 

2. Common variance, var (a,) = cr 2 for all i. 

3. Independence. u t and «, are indepdndent for all i j. 

4. Independence of x } . u, and jc, are independent for all / and j. This assump¬ 
tion automatically follows if x , are considered nonrandom variables. With 
reference to Figure 3.1, what this says is that the distribution of u does 
not depend on the value of jc. 

5. Normality, u, are normally distributed for all i. In conjunction with as¬ 
sumption 1, 2, and 3 this implies that u, are independently and normally 
distributed with mean zero and a common variance cr 2 . We write this as 
w, ~ IN(0, a 2 ). 

These are the assumptions with which we start. We will, however, relax 
some of these assumptions in later chapters. 

Assumption 2 is relaxed in Chapter 5. 

Assumption 3 is relaxed in Chapter 6. 

Assumption 4 is relaxed in Chapter 9. 

As for the normality assumption, we retain it because we will make infer¬ 
ences on the basis of the normal distribution and the t and F distributions. The 
first assumption is also retained throughout. 

Since E(Uj) = Owe can write (3.3) as 

£(y,-) = a + 0x, (3.4) 

This is also often termed the population regression function. When we substi¬ 
tute estimates of the parameters a and (3 in this, we get the sample regression 
function. 

We will discuss three methods for estimating the parameters a and 0. 

1. The method of moments. 

2. The method of least squares. 

3. The method of maximum likelihood. 

The first two methods are discussed in the next two sections. The last method 
is discussed in the appendix to this chapter. In the case of the simple regression 
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model we are considering, all three methods give identical estimates. When it 
comes to generalizations, the methods give different estimates. 


3.3 The Method of Moments 


The assumptions we have made about the error term u imply that 
E(u) = 0 and cov(x, u) = 0 

In the method of moments, we replace these conditions by their sample coun¬ 
terparts. 

Let a and (3 be the estimators for a and (3, respectively. The sample coun¬ 
terpart of u, is the estimated error u, (which is also called the residual), defined 
as 

u, = y, - a - px, 

The two equations to determine a and (3 are obtained by replacing population 
assumptions by their sample counterparts. 

Population Assumption Sample Counterpart 

E(u ) = 0 \ 2", = 0 or = 0 

cov (x, u) = 0 ~ 2*i^i = 0 or 2*i“i = 0 

In these and the following equations, 2 denotes 2?=i- Thus we get the two 
equations 

2 «, = 0 or 2 (y, “ « “ P*,) = 0 

2 x,u, = 0 or 2 x ,(y, ~a - (3x,) = 0 

These equations can be written as (noting that 2 « = na) 

2 y, = na + 0 2 *. 

2 xy, = d 2 *, + P 2 *? 

Solving these two equations, we get a and p. These equations are also called 
“normal equations.” In Section 3.4, we will show further simplifications of the 
equations as well as methods of making statistical inferences. First, we con¬ 
sider an illustrative example. 


Illustrative Example 

Consider the data on advertising expenditures (a) and sales revenue (y) for an 
athletic sportswear store for 5 months. The observations are as follows: 
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Month 

Sales Revenue, y 
(thousands of dollars} 

Advertising Expenditure, x 
(hundreds of dollars) 

1 

3 

1 

2 

4 

2 

3 

2 

3 

4 

6 

4 

5 

8 

5 


To get a and p, we need to compute 2 2 Jt 2 . 2 and 2 >’• We have 


Observation 

X 

y 

* 2 

xy 

4, 

1 

1 

3 

1 

3 

0.8 

2 

2 

4 

4 

8 

0.6 

3 

3 

2 

9 

6 

-2.6 

4 

4 

6 

16 

24 

0.2 

5 

5 

8 

25 

40 

1.0 

Total 

15 

23 

55 

81 

0 


The normal equations are (since n — 5) 

5a + 15p = 23 
15a + 55p - 81 

These give & = 1.0, p = 1.2. Thus the sample regression equation is 

y = 1.0 + 1.2* 

The sample observations and the estimated regression line are shown in Fig¬ 
ure 3.2. 

The intercept 1.0 gives the value of y when * = 0. This says that if advertis¬ 
ing expenditures are zero, sales revenue will be $1000. The slope coefficient is 
1.2. This says that if * is changed by an amount Ax, the change in y is Ay = 
1.2Ax. For example, if advertising expenditures are increased by 1 unit ($100), 
sales revenue increases by 1.2 units ($1200) on the average . Clearly, there is 
no certainty in this prediction. The estimates 1.0 and 1.2 have some uncertainty 
attached. We discuss this in Section 3.5, on statistical inference in the linear 
regression model. For the present, what we have shown is how to obtain esti¬ 
mates of the parameters a and p. One other thing to note is that it is not ap¬ 
propriate to obtain predictions too far from the range of observations. Other¬ 
wise, the owner of the store might conclude that by raising the advertising 
expenditures bv $100,000, she can increase sales revenue by $1,200,000. 

We have also shown the estimated errors or the residuals, which are given 
by 

tit = y f — 1.0 — 1.2*,- 

These are the errors we would make if we tried to predict the value of y on the 
basis of the estimated regression equation and the values of *. Note that we 
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Intercept 



figure 3.2. Sample observations and estimated regression line. 


are not trying to predict any future values of y. We are considering only within 
sample prediction errors. We cannot get u, for jc, = 0 or x, = 6 because we do 
not have the corresponding values of y r Out of sample prediction is considered 
in Section 3.6. 

Note that ^ u, = 0 by virtue of the first condition we imposed. The sum of 
squares of the residuals is 

2 % = ( 0 . 8) 2 + ( 0 . 6) 2 + (— 2 . 6) 2 + ( 0 . 2) 2 + ( 1 . 0) 2 = 8.8 

The method of least squares described in the next section is based on the 
principle of choosing a and p so that ^ ii; is minimum. That is, the sum of 
squares of prediction errors is the minimum. With this method we get the same 
estimates of a and p as we have obtained here because we get the same normal 
equations. 

3.4 Method of Least Squares 


The method of least squares is the automobile of modern statistical analysis; 
despite its limitations, occasional accidents, and incidental pollution, it and 
its numerous variations, extensions and related conveyances carry the bulk 
of statistical analysis, and are known and valued by all. 

Stephen M. Stigler' 


‘S. M. Stigler, “Gauss and the Invention of Least Squares,” The Annals of Statistics, Vol. 9. 
No. 3, 1981, pp. 465-474. 
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The method of least squares requires that we should choose & and (3 as es¬ 
timates of a and p, respectively, so that 

Q = g (y, - & - fo) 2 (3.5) 

is a minimum. Q is also the sum of squares of the (within-sample) prediction 
errors when we predict y, given x, and the estimated regression equation. We 
will show in the appendix to this chapter that the least squares estimators have 
desirable optimal properties. This property is often abbreviated as BLUE (best 
least unbiased estimators). We are relegating the proof to the appendix so that 
readers not interested in proofs can proceed. 

The intuitive idea behind the least squares procedure can be described fig¬ 
uratively with reference to Figure 3.2, which gives a graph of the points (x v , y,). 
We pass the regression line through the points in such a way that it is “as close 
as possible” to the points. The question is what is meant by “close.” The pro¬ 
cedure of minimizing Q in (3.5) implies that we minimize the sum of squares of 
vertical distances of the points from the line. Some alternative methods of mea¬ 
suring closeness are illustrated in Figure 3.4. 

With the readily available computer programs, readers interested in just ob¬ 
taining results need not even know how all the estimators are derived. How¬ 
ever, it is advisable to know a little bit about the derivation of the least squares 
estimators. Readers not interested in the algebraic detail can go to the illustra¬ 
tive examples. 

To minimize Q in equation (3.5) with respect to a and p, we equate its first 
derivatives with respect to a and (3 to zero. This procedure yields (in the fol¬ 
lowing equations, 2 denotes 2a=i) 



= o => 2 2(y; - & - fox-n * o 

dot 


or 

£ y, = na + p 2 


or 

y = & + fo 

(3.6) 

and 

^ = 0 => 2(y, — a — fo,)( — Xi) = 0 


or 




E y* = « E x i + P E 

(3.7) 


Equations (3.6) and (3.7) are called the normal equations. Substituting the 
value of a from (3.6) into (3.7), we get / 
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2 y,x, = 2 x,(y ~ |3x) + 0 2 •*? 

= nx(y - px) + p 2 ■*? 

Let us define 

■S’yy = E (y, - y) 2 = 2 y? ” «y 2 

= E (*. - *)(y, - y) = S *,y. - «*y 

and 

= S (*. " *) 2 = S ~ 

Then (3.8) can be written as 

= •S'xy or P = tt 2 

^XX 

Hence the least squares estimators for a and p are 


5 

P = -r 2 and a = y - px 

bXX 


(3.8) 


(3.9) 


(3.10) 


The estimated residuals are 

u, = y, - a - Px, 

The two normal equations show that these residuals satisfy the equations 
2 u, = 0 and 2 X A = 0 
The residual sum of squares (to be denoted by RSS) is given by 

Rss = 2 (y, " & - Px,) 2 
= S ty. - y - P(*, - *)l 2 

= S (y. _ y) 2 + P 2 E (*, - *) 2 - 20 2 (y, - y)(*, - *) 

= + p 2 5„ - 2p5^ 

But p = S x JS a . Hence we have 

RSS = 5^ - ^ = S„ - p<L, 

Syy is usually denoted by TSS (total sum of squares) and PS X>: is usually de¬ 
noted by ESS (explained sum of squares). Thus 

TSS = ESS + RSS 

(total) (explained) (residual) 

Some other authors like to use RSS to denote regression sum of squares and 
ESS as error sum of squares. The confusing thing here is that both the words 
“explained” and “error” start with the letter “e” and the words “regression” 
and “residual” start with the letter “r.” However, we prefer to use RSS for 
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residual sum of squares and ESS for explained sum of squares. 2 We will reserve 
the word residual to denote u = y - a - fix and the word error to denote the 
disturbance u in equation (33). Thus residual is the estimated error. 

The proportion of the total sum of squares explained is denoted by r 2 ,, where 
r„ is called the correlation coefficient. Thus r xx = ESS/TSS and 1 - rf = 
RSS/TSS. If r;, is high (close to 1), then x is a good “explanatory” variable for 
y. The term rf is called the coefficient of determination and must fall between 
zero and 1 for any given regression. If if is close to zero, the variable x explains 
very little of the variation in y. If rj v is close to 1, the variable x explains most 
of the variation in y. 

The coefficient of determination if is given by 


^ _ ESS = TSS - RSS 
* “ TSS " TSS ~ S„ 


Summary 

The estimates for the regression coefficients are 

S 

P = —^ and a = y — 0x 

The residual sum of squares is given by 


RSS 


C2 

Syy ~ ^ 


= S yv (l - rL) 


and the coefficient of determination is given by 


_ & 


EL 


s~s yy s„ 

The least squares estimators 0 and a yield an estimated straight line that has a 
smaller RSS than any other straight line. 


The Reverse Regression 

We have until now considered the regression of y on x. This is called the direct 
regression. Sometimes one has to consider the regression of x on y as well. 
This is called the reverse regression. The reverse regression has been advo¬ 
cated in the analysis of sex (or race) discrimination in salaries. For instance, if 

y = salary 

x = qualifications 

This is also the notation used in J. Johnston. Econometric Methods, 3rd ed. (New York: 
McGraw-Hill, 1984). 
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and v,e are interested in determining if there is sex discrimination in salaries, 
we can ask: 

1. Whether men and women with the same qualifications (value ot jc) are 
getting the same salaries (value of >). This question is answered by the 
direct regression, regression of y on jc. Alternatively, we can ask: 

2. Whether men and women with the same salaries (value of y) have the 
same qualifications (value of \) This question is answered by the reverse 
regression, regression of x on y. 3 

In this example both the questions make sense and hence we have to look at 
both these regressions. For the reverse regression the regression equation can 
be written as 


■*, = «' + P'y, + v, 

where v, are the errors that satisfy assumptions similar to those stated earliei 
in Section 3.2 for u,. Interchanging jc and y in the formulas that we derived, we 
get 

and &' = jc — (3'y 

Syy 

Denoting the residual sum of squares in this case by RSS', we have 

RSS' = S„- f* 

^yy 

Note that 

p p? = 

Hence if is close to 1, the two regression lines will be close to each other. 

We now illustrate with an example. 

Illustrative Example 

Consider the data in Table 3.2. The data are for 10 workers 

jc = labor-hours of work 
y = output 

We wish to determine the relationship between output and labor-hours of work 
We have 


— = 8 

10 


96 

10 


= 9.6 


’We discuss this problem further in Chapter 11 
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Table 3.2 


Observation 

X 

y 

r 

y 2 

xy 

1 

10 

11 

100 

121 

110 

2 

7 

10 

49 

100 

70 

3 

10 

12 

100 

144 

120 

4 

5 

6 

25 

36 

30 

5 

8 

10 

64 

100 

80 

6 

8 

7 

64 

49 

56 

7 

6 

9 

36 

81 

54 

8 

7 

10 

49 

100 

70 

9 

9 

11 

81 

121 

99 

10 

10 

10 

100 

100 

100 

Total 

80 

96 

668 

952 

789 

S xx = 668 — 

urn 2 = 

= 668 

; — 640 = 

= 28 


S v = 789 - 

10(8)(9.6) = 

789 - 768 = 

21 

= 952 - 

10(9.6) 2 

= 952 - 921.6 = 

30.4 

21 

21 

0.724 or 

r 2 

, = 0.52 

V28(30.4) 

29 

r x\ 

P 

$xy 

S XI ' 

21 

" 28 

* 0.75 



1 

’J>s 

II 

«8 

0* = 9.6 - 

0.75(0.8) 

= 3.6 


Hence the regression of y on x is 

y * 3.6 + 0.75* 

Since * is labor-hours of work and y is output, the slope coefficient 0.75 mea¬ 
sures the marginal productivity of labor. As for the intercept 3.6, it means that 
output will be 3.6 when labor-hours of work is zero! Clearly, this does not make 
sense. However, this merely illustrates the point we made earlier—that we 
should not try to get predicted values of y too far out of the range of sample 
values. Here * ranges from 5 to 10. 

As for the reverse regression we have 


and 


5 = 


21 

30.4 


= 0.69 


a' = x — 0'y = 8.0 - 9.6(0.69) = 1.37 
Hence the regression of x on y is 


* = 1.37 + 0.69y 
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Note that the product of the two slopes = (3(3' = 0.75(0.69) = 0.52 = rj v . 
These two regression lines are presented in Figure 3.3. The procedure used in 
the two regressions is illustrated in Figure 3.4. If we consider a scatter diagram 
of the observations, the procedure of minimizing 

n 

2 Cy, - « - p*,) 2 

i=i 

amounts to passing a line through the observations so as to minimize the 
sum of squares of the vertical distances of the points in the scatter diagram 
from the line. This is shown in Figure 3.4(a). The line shows the regression of 
y on x. 

On the other hand, the procedure of minimizing 

2 (x, - a' - (3'y,) 2 

;=i 

amounts to passing a line through the observations so as to minimize the 
sum of squares of the horizontal distances of the points in the scatter diagram 
from the line. This is shown in Figure 3.4(b). The line shows the regression of 
x on y. 

We can also think of passing a line in such a way that we minimize the sum 
of squares of the perpendicular distances of the points from the line. This is 
called orthogonal regression. 

Since a discussion of orthogonal regression is beyond the scope of this book, 
we will confine our discussion to only the other two regressions: regression of 



Figure 3.3. Regression lines for regression of y on x and x on y. 
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(a) Regression of y on x. {b) Regression of x on y 

Figure 3.4. Minimization of residual sum of squares in the regression of y on x and x on y. 

y and x and of x on y. A question arises as to which of these is the appropriate 

one. Following are some general guidelines on this problem. 

1. If the model is such that the direction of causation is known (e.g., if we 
say that advertising expenditures at time-r influence sales at time t but 
not the other way around), we should estimate the regression using sales 
as the explained variable and advertising expenditures as the explanatory 
variable. The opposite regression does not make sense. We should esti¬ 
mate this equation whether our objective is to estimate sales for given 
advertising expenditures or to estimate advertising expenditures for given 
sales (i.e., always estimate a regression of an effect variable on a causal 
variable). 

2. In problems where the direction of causation is not as clear cut, and 
where y and x have a joint normal distribution, both the regressions y on 
x and x on y make sense and one should estimate a regression of y on x 
to predict y given x and a regression of x on y to predict x given y. 

3. In models where y and x are both measured with error, we have to esti¬ 
mate both the regressions of v on x and x on y to get “bounds” on p. This 
problem is discussed in Chapter 11. 

4. The case of salary discrimination mentioned earlier is an example where 
the problem can be posed in two different and equally meaningful ways. 

In such cases both regressions make sense. 

5. Sometimes, which regression makes sense depends on how the data are 
generated. Consider the data presented in Table 3.2. x is labor-hours of 
work and y is output, and the observations are for different workers. 
Which of the two regressions makes sense depends on the way the data 
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were generated. If the workers are given some hours of work (jc), and the 
output they produced (y) was observed, then a regression of y on x is the 
correct one to look at. In this case jc is the controlled variable. On the 
other hand, if the workers were assigned some amount of output (y) and 
the hours of work (x) that they took to produce that output was observed, 
it is a regression of x on y that is meaningful. In this case y is the con¬ 
trolled variable. Here what we are considering is an experiment where 
one of the variables is controlled or fixed. With experimental data which 
of the variables should be treated as dependent and which as independent 
will be clear from the way the experiment was conducted. With most 
economic data, however, this is not a clear-cut choice. 


3.5 Statistical Inference in the 
Linear Regression Model 

In Section 3.4 we discussed procedures for obtaining the least squares esti¬ 
mators. To obtain the least squares estimators of a and p we do not need to 
assume any particular probability distribution for the errors u,. But to get in¬ 
terval estimates for the parameters and to test any hypotheses about them, we 
need to assume that the errors u, have a normal distribution. As proved in the 
appendix to this chapter, the least squares estimators are 

1. Unbiased. 

2. Have minimum variance among the class of all linear unbiased esti¬ 
mators. S 

The properties hold even if the errors u, do not have the normal distribution 
provided that the other assumptions we have made are satisfied. These as¬ 
sumptions, we may recall, are: 


1. E(u,) = 0. 

2. V(u,) = cr 2 for all /. 

3. u, and u } are independent for i j. 

4. Xj are nonstochastic. 

We will now make the additional assumption that the errors u, are normally 
distributed and show how we can get confidence intervals for a and p and test 
any hypotheses about a and p. We will relegate the proofs of the propositions 
to the appendix and state the results here. 

First, we need the sampling distributions of a and (3. We can prove that they 
have a normal distribution (proofs are given in the appendix). Specifically, we 
have the following results: 
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These results would be useful if the error variance cr 2 were known. Unfortu¬ 
nately, in practice, cr is not known, and has to be estimated. 

If RSS is the residual sum of squares, then 



Further the distribution of RSS is independent of the distributions of a and (3. 
(Proofs of these propositions are relegated to the appendix.) 

This result can be used to get confidence intervals for cr or to test hypotheses 
about cr. However, we are still left with the problem of making inferences 
about a and p. For this purpose we use the /-distribution. 

We know that if we have two variables x, ~ N(0, 1) and x 2 ~ x 2 with degrees 
of freedom k and x, and x 2 independent, then 

x t _ standard normal 
X/xJk X independent averaged x 2 
has a /-distribution with d.f. k. 

In this case (p — P)/Vor 2 /5„ ~ N(0, 1). (We have subtracted the mean and 
divided it by the standard deviation.) Also. RSS lo 2 ~ xl - 2 and the two distri¬ 
butions are independent. Hence we compute the ratio 

p-P / 

I 

has a /-distribution with d.f. in - 2). Now a 2 cancels out and writing RSS/ 
0 n - 2) as d 2 we get the result that (P - p)/Vd 2 /5 tl has a /-distribution with 
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d.f. (n — 2). Note that the variance of 0 is a 2 /S VI . Since a 2 is not known, we 
use an unbiased estimator RSS/(n — 2). Thus d 2 /S„ is the estimated variance 
of 0 and its square root is called the standard error denoted by SE(0). 

We can follow a similar procedure for a. We substitute d 2 for a 2 in the vari¬ 
ance of a and take the square root to get the standard error SE(a). 

Thus we have the following result: 

(a — a)/SE(d) and (0 - p)/SE(0) each have a /-distribution with 
d.f. (n - 2). These distributions can be used to get confidence in¬ 
tervals for a and (3 and to test hypotheses about a and (3. 


d is usually called the standard error of the regression. It is denoted by SER 
(sometimes by SEE). 

Illustrative Example 

As an illustration of the calculation of standard errors, consider the data in 
Table 3.2. Earlier we obtained the regression equation of y on x as 

$ = 3.6 + 0.75* 

We will now calculate the standard errors of & and 0. These are obtained by: 

1. Computing the variances of a and 0 in terms of a 2 . 

2. Substituting d 2 = RSS/(n - 2) for a 2 . 

3. Taking the square root of the resulting expressions. 

We have 



SE(d) = V(2.39)(1.83) = 2.09 
SE(0) = V(0.036)(1.83) = 0.256 

The standard errors are usually presented in parentheses under the estimates 
of the regression coefficients. 

Confidence Intervals for a, p, and a 2 

Since (a - a)/SE(a) and (0 - (3)/SE(0) have /-distributions with d.f. (n - 2), 
using the table of the /-distribution (in the appendix) with n — 2 = 8 d.f. we 
get 
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and 



-2.306 < 


& — a 
SE(a) 


<2.306 


-2.306 < 


£ - P 

SE(p) 


<2.306 


* 0.95 


* 0.95 


These give confidence intervals for a and p. Substituting the values of 
a, (3, SE(a), and SE(jS) and simplifying we get the 95% confidence limits for a 
and p as (— 1.22,8.42) for a and (0.16,1.34) for (3. Note that the 95% confidence 
limits for a are & at 2.306 SE(a) and for (3 are (3 ± 2.306SE((3). Although it is 
not often used, we will also illustrate the procedure of finding a confidence 
interval for a 2 . Since we know that RSS/ct 2 has a x 2 -distribution with d.f. (n - 
2), we can use the tables of the x 2 -d«stribution for getting any required confi¬ 
dence interval. Suppose that we want a 95% confidence interval. Note that we 
will write RSS = (n - 2)cr 2 . 

From the tables of the x 2 -distribution with degrees of freedom 8, we find that 
the probability of obtaining a value < 2.18 is 0.025 and of obtaining a value > 
17.53 is 0.025. Hence 


\ 

Prob | 2.18 < — < 17.53 1 = 0.95 < 


or 


Prob •ifr; < < Its I = 0 95 

17.53 2.18 


^ = 0.5 


Since a 2 — 1.83, substituting this value, we get the 95% confidence limits for 
ct 2 as (0.84,6.72). 

Note that the confidence intervals for a and p are symmetric around 
& and p, respectively (because the /-distribution is a symmetric distribution). 
This is not the case with the confidence interval for a 2 . It is not symmetric 
around cr. 

The confidence intervals we have obtained for a, f3, and a 2 are all very wide. 
We can produce narrower intervals by reducing the confidence coefficient. For 
instance, the 80% confidence limits for {3 are 


p ± 1.397SE(p) since Prob(-1.397 < t < 1.397) = 0.80 


from the /-tables with 8 d.f. 

We therefore get the 80% confidence limits for P as 

0.75 ± 1.397(0.256) = (0.39, 1.11) 

The confidence intervals we have constructed are two-sided intervals. Some¬ 
times we want upper or lower limits for p. In which case we construct one¬ 
sided intervals. For instance, from the /-tables with 8 d.f. we have 

Prob(/ < 1.86) = 0.95 
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and 

Prob(/> -1.86) = 0.95 

Hence for a one-sided confidence interval, the upper limit for (3 is 

P + 1.86SE(p) = 0.75 + 1.86(0.256) = 0.75 + 0.48 = 1.23 

The 95% confidence interval is (- 00 , 1.23). Similarly, the lower limit for p is 

p - 1.86SE(P) = 0.75 - 0.48 = 0.27 

Thus the 95% confidence interval is (0.27, + °°). 

We will give further examples of the use of one-sided intervals after we dis¬ 
cuss tests of hypotheses. 


Testing of Hypotheses 

Turning to the problem of testing of hypotheses, suppose we want to test the 
hypothesis that the true value of p is 1.0. We know what 

_ p - P has a /-distribution with-' 

t(] SE(p) (n - 2) degrees of freedom 

Let / 0 be the observed /-value. If the alternative hypothesis is p ¥= 1 then we 
consider |/ 0 | as the test statistic. Hence if the true value of p is 1.0, we have 

k = — Q256 = ~0.98 Hence |/ 0 | = 0.98 


Looking at the /-tables for 8 degrees of freedom, we see that 

Prob(/ > 0.706) = 0.25 


and 


Prob(/ > 1.397) = 0.10 

Thus Prob(/ > 0.98) is roughly 0.19 (by linear interpolation) or Prob(|/| > 0.98) 
— 0.38. This probability is not very low and we do not reject the hypothesis 
that p = 1.0. It is customary to use 0.05 as a low probability and to reject the 
suggested hypothesis if the probability of obtaining as extreme a /-value as the 
observed t 0 is less than 0.05. In this case either the suggested hypothesis is not 
true or it is true but an improbable event has occurred. 

Note that for 8 degrees of freedom, the 5% probability points are ± 2.306 for 
a two-sided test and ±1.86 for a one-sided test. Thus if both high and low 
/-values are to be considered as evidence against the suggested hypothesis, we 
reject it if the observed / 0 is > 2.306 or < -2.306. On the other hand, if only 
very high or very low /-values are to be considered as evidence against the 
suggested hypothesis, we reject it if t 0 > 1.86 or / 0 < - 1.86, depending on the 
suggested direction of deviation. 1 
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Although it is customary to use the 5% probability level for rejection of the 
suggested hypothesis, there is nothing sacred about this number. The theory of 
significance tests with the commonly used significance levels of 0.05 and 0.01 
owes its origins to the famous British statistician Sir R. A. Fisher (1890-1962). 
He is considered the father of modern statistical methods and the numbers 0.05 
and 0.01 suggested by him have been adopted universally. 

Another point to note is that the hypothesis being tested (in this case (3 = 1) 
is called the null hypothesis. Again the terminology is misleading and owes the 
origin to the fact that the initial hypotheses tested were that some parameters 
were zero. Thus a hypothesis p = 0 can be called a null hypothesis but not a 
hypothesis p = 1. In any case for the present we will stick to the standard 
terminology and call the hypothesis tested the null hypothesis and use the stan¬ 
dard significance levels of 0.05 and 0.01. 

Finally, it should be noted that there is a correspondence between the con¬ 
fidence intervals derived earlier and tests of hypotheses. For instance, the 95% 
confidence interval we derived earlier for p is (0.16 <P< 1 .34). Any hypoth¬ 
esis that says p = p 0 , where p 0 is in this interval, will not be rejected at the 5% 
level for a two-sided test. For instance, the hypothesis p = 1.27 will not be 
rejected, but the hypothesis p = 1.35 or p = 0.10 will be. For one-sided tests 
we consider one-sided confidence intervals. ~~ 

It is also customary to term some regression coefficients as “significant” or 
“not significant” depending on the /-ratios, and attach asterisks to them if they 
are significant. This procedure should be avoided. For instance, in our illustra¬ 
tive example the regression equation is sometimes presented as 

y = 3.6 + 0.75** 

12 . 09 ) < 0 . 256 ) 

The * on the slope coefficient indicates that it is “significant” at the 5% level. 
However, this statement means that “it is significantly different from ‘zero’ ” 
and this statement is meaningful only if the hypothesis being tested is p = 0. 
Such a hypothesis would not be meaningful in many cases. For instance, in our 
example, if y is output and x is labor-hours of work, a hypothesis p = 0 does 
not make any sense. Similarly, if y is a posttraining score and x is a pretraining 
score, a hypothesis p = 0 would imply that the pretraining score has no effect 
on the posttraining score and no one would be interested in testing such an 
extreme hypothesis. 

Example of Comparing Test Scores from the GRE 
and GMAT Tests 4 

In the College of Business Administration at the University of Florida, two 
different tests are used to measure aptitude for graduate work. The economics 
department relies on CRE scores and the other graduate departments rely on 

4 I would like to thank my colleague Larry Kenny for this example and the computations. 
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GMAT scores. To allocate graduate assistantships and fellowships, it is impor¬ 
tant to be able to compare the scores of these two tests. A rough rule of thumb 
that was suggested was the following. 

GRE score = 2(GMAT score) + 100 

) 

A question arose as to the adequacy of this rule. 

To answer this question data were obtained on 262 current and recent Uni¬ 
versity of Florida students who had taken both tests. The GRE and GMAT 
scores were highly correlated. The correlation coefficients were 0.71 for U.S. 
students and 0.80 for foreign students. 

If we have GRE scores on some students and GMAT scores on some others, 
we can convert all scores to GRE scores, in which case we use the regression 
of GRE score on GMAT score. Alternatively, we can convert all scores to 
GMAT scores, in which case we use the regression of GMAT score on GRE 
score. These two regressions were as follows (figures in parentheses are stan¬ 
dard errors): 


Students 

Numbers 

Regression 

R 1 


Regression of GRE on GMAT 


AH 

255 

GRE = 333 + 1.470GMAT 

.53 



( 0.0871 


U.S. 

211 

G&E = 336 + 1.458GMAT 

.50 



( 0 . 101 ) 


Foreign 

44 

GRE = 284 + 1.606GMAT 

.64 



( 0 . 183 ) 



Regression of GMAT on GRE 


All 

255 

G1V1AT = 125 4- 0.363GRE 

.53 



( 0 . 021 ) 


U.S. 

211 

GMAT = 151 + 0.345GRE 

.50 



( 0 . 024 ) 


Foreign 

44 

GMAT = 60 + 0.400GRE 

.64 


(0 046 ) 


Although we can get predictions for the U.S. students and foreign students 
separately, and for the GRE given GMAT and GMAT given GRE, we shall 
present only one set of predictions and compare this with the present rule. The 
predictions, from the regression of GRE on GMAT (for all students) are as 
follows: 


GMAT 

GRE 

Current Rule“ 

400 

921 

900 

450 

995 

1000 

500 

1068 

1100 

550 

1142 

1200 

600 

1215 

1300 

650 

1289 

1400 

700 

1362 

1500 


'GRE = 2(GMAT) + 100. 
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Thus, except for scores at the low end (which are not particularly relevant 
because students with low scores are not admitted), the current rule overpre¬ 
dicts the GRE based on the GMAT score. Thus, the current conversion biases 
decisions in favor of those taking the GMAT test as compared to those taking 
the GRE test. Since the students entering the economics department take the 
GRE test, the current rule results in decisions unfavorable to the economics 
department. 


Regression with No Constant Term 

Sometimes the regression equation is estimated with the constant term ex¬ 
cluded. This is called regression through the origin. This arises either because 
economic theory suggests an equation w ith no constant term or we end up with 
a regression equation with no constant term because of some transformations 
in the variables. (These are discussed in Section 5.4 of Chapter 5.) 

In this case the normal equations and other formulas will be the same as 
before, except that there will be no “mean corrections.” That is S xx = 
2-*/. — ]£*,}'»■ = IZyh etc. Consider the data in Table 3.2. We have 

S„ = 668, S v = 789 and S yv = 952 

P 

o 2 

V(0) 

SE(p) = .058 

f 3 = SIKS^SJ = .979 
Thus, the regression equation is 

_ 1.181a: _2 _ n os 

y (.058) 

Compared with the regression with the constant term, the results look better! 
The r-ratio for (3 has increased from 3 to 20. The r 2 has increased dramatically 
from 0.52 to 0.98. Students often come to me with high R 2 ' s (sometimes 0.9999) 
when they fit a regression with a constant term and claim that they got better 
fits. Bui this is spurious; one has to look at which equation predicts better. 

Some computer regression programs allow the option of not including a con¬ 
stant term but do not give the correct r 2 . Note that 1 - r 2 = (Residual Sum of 
Squares)/5 W . If the residual sum of squares (RSS) is calculated from a regres¬ 
sion with no constant term, but is calculated with the mean correction, then 
we can even get a negative r. Even if we do not get a negative r 2 , we can get a 
very low r 2 . For instance, in the example with data from Table 3.2, we have 
RSS = 20.08 from the regression with no constant term and S vv (with mean 
correction) = 30.4. Hence, r 2 = 0.34 when calculated (wrongly) this way. 


= S xt /5„ = 789/688 = 1.181 

- - l [*B - tIt] - 2 23 

= a 2 /^ = 2.23/668 = .00334 
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3.6 Analysis of Variance for the 
Simple Regression Model 


Yet another item that is often presented in connection with the simple linear 
regression model is the analysis of variance. This is the breakdown of the total 
sum of squares TSS into the explained sum of squares ESS and the residual 
sum of squares RSS. The purpose of presenting the table is to test the signifi¬ 
cance of the explained sum of squares. In this case this amounts to testing the 
significance of (3. Table 3.3 presents the breakdown of the total sum of squares. 

Under the assumptions of the model, RSS/cr 2 has a x 2 -distribution with (n - 
2) degrees of freedom. ESS/a 2 , on the other hand has a x 2 -distribution with 1 
degree of freedom only if the true (3 is equal to zero. Further, these two x 2 
distributions are independent. (All these results are proved in the appendix to 
this chapter.) 

Thus under the assumption that (3 = 0, we have the F-statistic (cr 2 cancels) 
F = (ESS/l)/RSS/(« - 2), which has an F-distribution with degrees of freedom 
1 and (n - 2). This F-statistic can be used to test the hypothesis that (3 = 0. 

For the data in Table 3.2 we are considering, the analysis of variance is pre¬ 
sented in Table 3.4. The F-statistic is F = 15.75/1.83 = 8.6. 

Note that the f-statistic for testing the significance of (3 is (3/SE(j3) = 
0.75/0.256 = 2.93 and the F-statistic obtained from the analysis of variance 
Table 3.4 is the square of the f-statistic. Thus in the case of a simple linear 
regression model, the analysis-of-variance table does not give any more infor¬ 
mation. We will see in Chapter 4 that in the case of multiple regression this is 
not the case. What the analysis-of-variance table provides there is a test of 
significance for all the regression parameters together. Note that 


Table 3.3 Analysis of Variance for the Simple Regression Model 


Source of 
Variation 

Sum of Squares 

Degrees of 
Freedom 

Mean Square 

X 

ESS = 

1 

ESS/1 

Residual 

RSS = S„~ flSjry 

n — 2 

RSS/(« - 2) 

Total 

TSS = S yy 

n — 1 



Table 3.4 Analysis of Variance for the Data in 
Table 3.2 


Source of 
Variation 

Sum of 
Squares 

Degrees of 
Freedom 

Mean 

Square 

X 

15.75 

1 

15.75 

Residual 

14.65 

8 

1.83 

Total 

30.4 

9 
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ESS = = i*S v 

and 

RSS = S n - pS (1 = (1 - F)S VV 
Hence the F-statistic can also be written as 


(1 - F)/(* - 2) 


(n - iy 
1 - r 2 


Since F — f we get 


r 2 = 


t 2 + (n - 2) 


We can check in our illustrative example that t 2 
Hence 


8.6 and (n — 2) = 8. 


r* = 


8.6 

8.6 + 8 


8.6 

16.6 


0.52 


as we obtained earlier. The formula 


t 2 + (n - 2) 


gives the relationship between the /-ratio for testing the hypothesis (3 = 0 and 
the r 2 . We will derive a similar formula in Chapter 4. 


3.7 Prediction with the Simple 
Regression Model 


The estimated regression equation y = d + (3x is used for predicting the value 
of y for given values of x and the estimated equation x = a' + (3'y is used for 
predicting the values of jc for given values of y. We will illustrate the procedures 
with reference to the prediction of y given x. 

Let Jt 0 be the given value of jc. Then we predict the corresponding value of 
y 0 of y by 

y„ = a + (Jx 0 (3.11) 

The true value y„ is given by 

y 0 = a + 0x o + Uq 

where « 0 is the error term. 

Hence the prediction error is 

So - y<> = (a - tt) + ($ - PK - «o 
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Since E(a - a) = 0, E(fi - p) = 0, and E(u 0 ) = 0 we have 


E(f o - y 0 ) = 0 


This equation shows that the predictor given by equation (3.11) is unbiased. 
Note that the predictor is unbiased in the sense that E(y 0 ) = £(y 0 ) since both 
y 0 and y 0 are random variables. 

The variance of the prediction error is 


V(>’ 0 - y 0 ) = V(a - a) + ^V(p - p) + 2x 0 cov(a - a, p - P) + V(u 0 ) 


n 


4 


+ - 7 - I + tr 2 -2 x n cr' 


i + i + 

n S , 


5 ,. 

x? 


+ (T Z 


Thus the variance increases the farther away the value of x 0 is from x, the mean 
of the observations on the basis of which a and p have been computed. This 
is illustrated in Figure 3.5, which shows the confidence bands for y. 

If x 0 lies within the range of the sample observations on x, we can call it 
within-sample prediction, and if x 0 lies outside the range of the sample obser¬ 
vations, we call the prediction out-of-sample prediction. As an illustration, con¬ 
sider a consumption function estimated on the basis of 12 observations. The 
equation is (this is just a hypothetical example) 



Figure 3.5. Confidence bands for prediction of y given jc. [Confidence bands shown are 
y ± Z, where Z = iSE(v) and t is the f-value used.] 
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y = 10.0 + 0.90* ' 

where y = consumer expenditures 
x = disposable income 

We are given d 2 = 0.01, x = 200, and = 4000. Given x 0 = 250, our predic¬ 
tion of y 0 is 

So = 10.0 + 0.9(250) = 235 



Since t = 2.228 from the r-tables with 10 degrees of freedom, the 95% confi¬ 
dence interval for y 0 is 235 ± 2.228(0.131) = 235 ± 0.29, that is, (234.71, 
235.29). 


Prediction of Expected Values 

Sometimes given x 0 , it is not y 0 but E(y 0 ) that we would be interested in pre¬ 
dicting; that is. we are interested in the mean of y 0 , not > 0 as such. We will give 
an illustrative example of this. Since E(y 0 ) = a + f3x 0 , we would be predicting 
this by £(y 0 ) = a + f3x tf , which is the same as that we considered earlier. 
Thus our prediction would be the same whether it is y 0 or E(y 0 ) that we want 
to predict. However, the prediction error would be different and the variance 
of the prediction error would be different (actually smaller). This means that 
the confidence intervals we generate would be different. The prediction error 
now is 

£(y 0 ) - E(y 0 ) = (a - «) + (p - p)x 0 

Note that this is the same as the prediction error y 0 - y 0 with -« 0 missing. 
Since the variance of this is a 1 , we have to subtract a 2 from the error variance 
we obtained earlier. Thus 

var[£(y„) - E(y 0 )] = a 2 - + <*LZ& 

L« 

The standard error of the prediction is given by the square root of this expres¬ 
sion after substituting d 2 for cr. The confidence intervals are given by E(y 0 ) ± 
(SE. Where t is the appropriate f-value. 


Illustrative Example 

Consider the sample of the athletic sportswear store considered in Section 3.3 
earlier. The regression equation estimated from the data from 5 months pre¬ 
sented there is 

S = 1.0 + 1.2* * = 3.0 RSS = 8.8 
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Suppose that the sales manager wants us to predict what the sales revenue will 
be if advertising expenditures are increased to $600. He would also like a 90% 
confidence interval for his prediction. 

We have x 0 = 6. Hence 

S 0 = 1.0 + 1 . 2 ( 6 ) = 8.2 


The variance of the prediction error is 




(6 - 3) 2 
10 


2.1 cr 2 


Since 


d 2 = 


RSS 

d.f. 


O 

3 


2.93 


the standard error of the prediction is 

V2.1(2.93) = V053 = 2.48 


The 5% point from the /-distribution with 3 d.f. is 2.353. The 90% confidence 
interval for y c given x 0 = 6 is, therefore 

8.2 ± 2.48(2.353) = (2.36, 14.04) 


Thus the 90% confidence interval for sales revenue if advertising expenditures 
are $600 is ($2360; $14,040). 

Consider now the case where the sales manager wants us to predict the av¬ 
erage sales per month over the next two years when advertising expenditures 
are $600 per month. He also wants a 90% confidence interval for the prediction. 

Now we are interested in predicting E(y 0 ), not y 0 . The prediction is still given 
by 1.0 + 1.2(6) = 8.2. The variance of the prediction error is now 


o 2 



(6 - 3) 2 ~ 

10 


l.lcr 2 


Substituting cf 2 = 2.93 as before and taking the square root, we now get the 
standard error as 


SE|£(y 0 )] = 1.795 
The 90% confidence interval now is 


8.20 ± 2.353(1.795) = 8.20 ± 4.22 

Thus the 90% confidence interval for the average sales is ($3980; $12,420). 
Note that this confidence interval is narrower than the one we obtained for y 0 . 


3.8 Outliers 


Very often it happens that the estimates of the regression parameters are influ¬ 
enced by a few extreme observations or outliers. This problem can be detected 
if we study the residuals from the estimated regression equation. Actually, a 
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Table 3.5 Four Data Sets That Give an Identical Regression Equation 


Data Set: 


1-3 

/ 

2 

3 

4 

4 

Variable: 


X 

y 

y 

y 

X 

y 

Observation 

1 

10.0 

8.04 

9.14 

7.46 

8.0 

6.58 


2 

8.0 

6.95 

8.14 

6.77 

8.0 

5.76 


3 

13-0 

7.58 

8.74 

12.74 

8.0 

7.71 


4 

9.0 

8.81 

8.77 

7.11 

8.0 

8. 


5 

11.0 

8.33 

9.26 

7.81 

8.0 

8.47 


6 

14.0 

9.96 

8.10 

8.84 

8.0 

7.04 


7 

6.0 

7.24 

6.13 

6.08 

8.0 

5.25 


8 

4.0 

4.26 

3.10 

5.39 

19.0 

12.50 

* 

9 

12.0 

10.84 

9.13 

8.15 

8.0 

5.56 


10 

7.0 

4.82 

7.26 

6.42 

8.0 

7.91 


1! 

5.0 

5.68 

4.74 

5.73 

8.0 

6.89 


detailed analysis of residuals should accompany every estimated equation. 
Such analysis will enable us to see whether we are justified in making the as¬ 
sumption that: 

1. The error variance V(«,) = o 2 for all i. This problem is treated in Chapter 
5. 

2. The error terms are serially independent. This problem is treated in Chap¬ 
ter 6. 

3. The regression relationship is linear. This problem is treated in Section 
3.9. 


What we will be concerned with here is detecting some outlying observations 
using analysis of residuals. A more detailed discussion of analysis of residuals 
is given in Chapter 12. Actually, what we are doing is a diagnostic checking of 
our patient (regression equation) to see whether anything is wrong. 

An outlier is an observation that is far removed from the rest of the obser¬ 
vations. This observation is usually generated by some unusual factors. How¬ 
ever, when we use the least squares method this single observation can produce 
substantial changes in the estimated regression equation. In the case of a simple 
regression we can detect outliers simply by plotting the data. However, in the 
case of multiple regression such plotting would not be possible and we have to 
analyze the residuals 

A good example to show that a simple presentation of the regression equation 
with the associated standard errors and r does not give us the whole picture is 
given by Anscombe. 5 There are four data sets presented in Table 3.5. The val¬ 
ues of x for the first three data sets are the same. All four data sets give the 
same regression equation. 


5 F. J. Anscombe, “Graphs in Statistical Analysis,” The American Statistician, Vol. 27, No. 1, 
February 1973, pp. 17-21. 
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For all of the four data sets, 

we have the following statistics: 

n = 11 

x = 9.0 

y = 7.5 

s„ = no.o 

S„ = 41.25 

= 55.0 

The regression equation is 




y = 3.0 + 0.5* r 2 = 0.667 

(0 118) 

regression sum of squares = 27.50 (1 d.f.) 

residual sum of squares = 13.75 (9 d.f.) 

Although the regression equations are identical, the four data sets exhibit 
widely different characteristics. This is revealed when we see the plots of the 
four data sets. These plots are shown in Figure 3.6. 



y 


5 10 15 20 

X 

(n) Data Set 2 





(m) Data Set 3 (iv) Data Set 4 

Figure 3.6 Identical regression lines for four different data sets. 
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Table 3.6 Per Capita Personal Consumption Expenditures (C) and Per Capita 
Disposable Personal Income (Kj (in 1972 Dollars) for the United States, 1929-1984 


Year 

C 

Y 

Year 

C 

Y 

Year 

C 

Y 

1929 

1765 

1883 

1957 

2416 

2660 

1977 

3924 

4280 

1933 

1356 

1349 

1958 

2400 

2645 

1978 

4057 

4441 

1939 

1678 

1754 

1959 

2487 

2709 

1979 

4121 

4512 

1940 

1740 

1847 

1960 

2501 

2709 

1980 

4093 

4487 

1941 

\m> 

2083 

1961 

2511 

2742 

1981 

4131 

4561 

1942 

1788 

2354 

1962 

2583 

2813 

1982 

4146 

4555 

1943 

1815 

2A29 

1963 

2644 

2865 

1983 

4303 

4670 

1944 

1844 

2483 

1964 

2751 

3026 

1984 

4490 

4941 

1945 

1936 

2416 

1965 

2868 

3171 




1946 

2129 

2353 

1966 

2979 

3290 




1947 

2122 

2212 

1967 

3032 

3389 




1948 

2129 

2290 

1968 

3160 

3493 




1949 

2140 

2257 

1969 

3245 

3564 




1950 

2224 

2392 

1970 

3277 

3665 




1951 

2214 

2415 

1971 

3355 

3752 




1952 

2230 

2441 

1972 * 

3511 

3860 




1953 

2277 

2501 

1973 

3623 

4080 




1954 

2278 

2483 

1974 

3566 

4009 




1955 

2384 

2582 

1975 

3609 

4051 




1956 

2410 

2653 

1976 

3774 

4158 





Source: Economic Report of the President, 1984, p. 261. 


Data set I shown in Figure 3.6(i) shows no special problems. Figure 3.6(ii) 
shows that the regression line should not be linear. Figure 3.6(iii) shows how a 
single outlier has twisted the regression line slightly. If we omit this one obser¬ 
vation (observation 3), we would have obtained a slightly different regression 
line. Figure 3.6(iv) shows how an outlier can produce an entirely different pic¬ 
ture. If that one observation (observation 8) is omitted, we would have a ver¬ 
tical regression line. 

We have shown graphically how outliers can produce drastic changes in the 
regression estimates. Next we give some real-world examples. 


Some illustrative Examples 

Example 1 

This example consists of the estimation of the consumption function for the 
United States for the period 1929-1984. Table 3.6 gives the data on per capita 
disposable income (F) and per capita consumption expenditures (C) both in 
constant dollars for the United States. The data are not continuous. They are 
for 1929, 1933. and then continuous from 1939. The continuous data for 1929- 
1939 can be obtained from an earlier President’s Economic Report (say, for 
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Table 3.7 Residuals for the Consumption Function Estimated from the Data in 
Table 3.6 (Rounded to the First Decimal) 


Observation 

Residual 

Observation 

Residual 

Observation 

Residual 

1 

75.0 

17 

24.2 

33 

24.1 

2 

152.4 

18 

41.6 

34 

-35.9 

3 

105.5 

19 

57.4 

35 

-37.1 

4 

82.8 

20 

18.8 

36 

20.5 

5 

-46.1 

21 

18.4 

37 

-67.9 

6 

-330.9 

22 

16.0 

38 

-60.2 

7 

-372.2 

23 

44.8 

39 

-55.4 

8 

-392.4 

24 

58.8 

40 

12.1 

9 

-239.4 

25 

38.7 

41 

51.0 

10 

11.0 

26 

46.0 

42 

37.4 

11 

132.4 

27 

59.7 

43 

36.7 

12 

68.4 

28 

20.1 

44 

31.5 

13 

109.4 

29 

5.0 

45 

2.1 

14 

70.5 

30 

7.6 

46 

22.5 

15 

39.5 

31 

-29.5 

47 

74.8 

16 

31.8 

32 * 

3.7 

48 

15.0 


1972). But these data are in 1958 dollars and thus we have to link the two 
series. 6 We have not attempted this. 

We will estimate a regression of C on K It was estimated by the SAS regres¬ 
sion package. The results are as follows: 

C = -24.944 + 0.911 Y r 2 = 0.9823 

(58 124 ) (0 018 ) 

The /-value for 3 is 0.911/0.018 = 50.47 and it can be checked that r 1 = ft 
[i t 2 + (n — 2)] and n = 48. (The figures have all been rounded to three decimals 
from the computer output.) The next step is to examine the residuals. These 
are presented in Table 3.7. One can easily notice the large negative residuals 
for observations 6, 7, 8, and 9. These observations are outliers. They corre¬ 
spond to the war years 1942-1945 with strict controls on consumer expendi¬ 
tures. We therefore discard these observations and reestimate the equation. 
The equation now is 

C = 85.725 + 0.885 Y r 2 = 0.9975 

(22 353 ) (0 007 ) 

Earlier the intercept was not significantly different from zero. Now it is signif¬ 
icantly positive. Further, the estimate of the marginal propensity to consume 


The data for 1929-1970 at 1958 prices have been analyzed in G. S Maddala, Econometrics 
(New York McGraw-Hill, 1977), pp. 84-86. These data are given as an exercise. 
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Table 3.8 Estimated Residuals for the Consumption Function Omitting the War 
Years(1942-1945) 


Observation 

Residual 

Observation 

Residual 

Observation 

Residual 

1 

12.4 

20 

-24.2 

35 

-52.1 

2 

76.1 

21 

-24.4 

36 

8.3 

3 

39.6 

22 

-27.2 

37 

-74.5 

4 

19.2 

23 

3.2 

38 

-68.6 

5 

- 103.7 

24 

17.2 

39 

-62.8 

10 

-39.7 

25 

-2.0 

40 

7.5 

11 

78.1 

26 

7.1 

41 

49.5 

12 

16.1 

27 

22.1 

42 

40.0 

13 

56.3 

28 

-13.4 

43 

41.1 

14 

20.8 

29 

-24.8 

44 

35.2 

15 

-9.6 

30 

-19.1 

45 

7.7 

16 

-16.6 

31 

-53.8 

46 

28.0 

17 

-22.7 

32 

-17.8 

47 

83.2 

18 

-5.8 

33 

4.3 

48 

30.3 

19 

12.6 

34 

-53.1 




is significantly lower (0.885 compared to 0.911). 7 The estimated residuals from 
this equation are presented in Table 3.8. We do not see any exceptionally large 
residual (except perhaps for observation 5, which is 1941), nor the long runs of 
positive and negative residuals as in Table 3.7. 

Example 2 

As a second example, consider the data presented in Table 3.9. They give teen¬ 
age unemployment rates and unemployment benefits in Australia for the period 
1962-1980. We will estimate some simple regressions of unemployment rates 
on unemployment benefits (in constant dollars) and present an analysis of the 
residuals. 8 
Defining 

y, = unemployment rate for male teenagers 
y 2 = unemployment rate for female teenagers 
x — unemployment benefit (constant dollars) 


’There is an increase in r 1 as well but the two r’s are not comparable. To compare the two we 
have to calculate the r between C and C from the first equation for just the 44 observations 
(excluding the war years). Then this recomputed r 1 will be comparable to the r 1 reported here. 
"These are not the equations estimated by Gregory and Duncan. We are just estimating some 
simple regressions which may not be very meaningful. 
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Table 3.9 Teenage Unemployment Rates and 
Unemployment Benefits 


Year 

Teenage 
Unemployment 
Rate (%) 

Unemployment Benefit 
for 16- and 
17-Year-Olds 

Males 

Females 

Nominal 

Dollars 

Constant 

Dollars 

1981 

1962 

4.5 

5.9 

3.50 

13.3 

1963 

3.4 

4.5 

3.50 

13.2 

1964 

2.5 

4.4 

3.50 

12.7 

1965 

4.4 

5.1 

3.50 

12.2 

1966 

3.1 

5.5 

3.50 

11.9 

1967 

4.1 

5.7 

3.50 

11.6 

1968 

4.7 

6.0 

3.50 

11.3 

1969 

5.8 

6.3 

4.50 

14.1 

1970 

5.7 

5.1 

4.50 

13.4 

1971 

6.9 

6.1 

4.50 

12.5 

1972 

8.6 

9.3 

7.50 

20.0 

1973 

7.6 

7.8 

23.00 

54.1 

1974 

11.9 

12.3 

31.00 

62.7 

1975 

14.6 

16.7 

36.00 

63.8 

1976 

13.6 

15.6 

36.00 

55.8 

1977 

17.1 

18.8 

36.00 

51.1 

1978 

16.4 

18.8 

36.00 

47.4 

1979 

15.9 

18.9 

36.00 

43.1 

1980 

15.7 

17.6 

36.00 

39.5 


Source: R. G. Gregory and R. C. Duncan. “High Teenage 
Unemployment: The Role of Atypical Labor Supply 
Behavior,” Economic Record, Vol. 56, December 1980, 
pp. 316-330. 


we get the following regressions: 


Si = 

2.478 + 0.212* 

r 2 = 0.690 

( 1 . 234 ) 

( 0 . 035 ) 


Si = 

3.310 + 0.226* 

r 2 = 0.660 


( 1 . 410 ) 

( 0 . 039 ) 



The equation suggests that an increase in unemployment benefit leads to an 
increase in the unemployment rate. However, the residuals from the two equa¬ 
tions presented in Table 3.10 show very large absolute residuals for 1973, 1974, 
and 1977-1980. 

Suppose that as in Example 2, we delete these observations and reestimate 
the equations. The estimates are 
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Table 3.10 Residuals from the Regressions of 
Unemployment Rates on Unemployment Benefits 


Year 

Regression of y, 

Regression of y 2 

1962 

-0.80 

-0.42 

1963 

-1.88 

-1.80 

1964 

-2.67 

-1.78 

1965 

-0.67 

-0.97 

1966 

-1.90 

-0.50 

1967 

-0.84 

-0.23 

1968 

-0.18 

0.13 

1969 

0.33 

-0.20 

1970 

0.38 

-1.24 

1971 

1.77 

-0.04 

1972 

1.88 

1.47 

1973 

-6.34 

-7.73 

1974 

-3.89 

-5.20 

1975 

-1.42 

-1.04 

1976 

-0.72 

-0.33 

1977 

3.78 

3.93 

1978 

3.86 

4.77 

1979 

4.27 

5.84 

1980 

5.04 

5.35 


2.156 + 0.203* 

r 2 = 0.876 

( 0 . 610 ) ( 0 . 623 ) 


2.799 + 0.225* 

i 2 = 0.955 

( 0 . 394 ) (0 015 ) 



The estimates of the coefficients of jc did not change much. The r 2 values are 
higher but again to compare the r 2 values we have to compute the implied r 2 
from the first equation using just these 13 observations. (The r 2 we compute is 
for y and f.) 

However, is the deletion of the six observations with high residuals the cor¬ 
rect procedure in this case? The answer is “no.” In Example 2 the deletion of 
the war years was justified because the behavior of consumers was affected by 
wartime controls. In this example there are no such controls. However, the 
large residuals are due to lags in behavior. First, with an increase in unemploy¬ 
ment benefit rate, there will be an increase in the labor force participation. 
There will be time lags involved in this. Next there are the time lags involved 
in qualifying for benefits, filing of claims, receipt of benefits, and so on. Thus 
there are substantial time lags between x and y„ and jc and y 2 . The large nega¬ 
tive residuals in 1973 and 1974 and the positive residuals in 1977-1980 are a 
consequence of this. Further, one can see a trend in these residuals from 1973- 
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1980. Thus the solution to this problem is not a deletion of observations as in 
the previous example of the consumption function but a reformulation of the 
problem taking account of labor force participation rates, time lags, and so on. 
This analysis is a lot more complicated for our purpose here. The example 
serves to illustrate the point that all “outliers” should not be deleted. 

3.9 Alternative Functional Forms 
for Regression Equations 

We saw earlier with reference to Figure 3.6(ii) that sometimes the relationship 
between y and x can be nonlinear rather than linear. In this case we have to 
assume an appropriate functional form for the relationship. There are several 
functional forms that can be used which after some transformations of the vari¬ 
ables, can be brought into the usual linear regression framework that we have 
been discussing. 

For instance, for the data points depicted in Figure 3.7(a), where y is increas¬ 
ing more slowly than x, a possible functional form is y = a + (3 log x. This is 
called a semilog form, since it involves logarithm of only one of the two vari¬ 
ables x and y. In this case if we redefine a variable X = log x, the equation 
becomes y = a + fiX. Thus we have a linear regression model with the ex¬ 
plained variable y and explanatory variable X = log x. 

For the data points depicted in Figure 3.7(b), where y is increasing faster 
than x, a possible functional form is y = Ae pJ . In this case we take logs of both 
sides and get another kind of semilog specification. 

logy = log .<4 + 0* 

If we define Y = log y and a — log A, we have 

Y = a + 0x 




(a) y increasing more slowly than x (b) y increasing faster than x 

Figure 3.7. Data sets for which linear regression is inappropriate. 
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which is in the form of a linear regression equation. 

An alternative model one can use is 

y = Ajc p 

In this case taking logs of both sides, we get 

log y = log A + 0 log x 

In this case p can be interpreted as an elasticity. Hence this form is popular in 
econometric work. This is called a double-log specification since it involves 
logarithms of both jc and y. Now define Y = log y, X = log x, and a = log A. 
We have 


Y = a + pz 

which is in the form of a linear regression equation. An illustrative example is 
given at the end of this section. 

There is. of course, a difference between those functional forms in which we 
transform the variable x and those in which we transform the variable y. This 
becomes clear when we introduce the error term u. For instance, when we 
write the transformed equation with an additive error, which we do before we 
use the least squares method, that is, we write 

Y = a + pjf + u 

we are assuming that the original equation in terms of the untransformed vari¬ 
ables is 

y = AxPe* 

that is, the error term enters exponentially and in a multiplicative fashion. If 
we make the assumption of an additive error term in the original equation, that 
is, 

y = Axfi + u 

then there is no way of transforming the variables that would enable us to use 
the simple methods of estimation described here. Estimation of this model re¬ 
quires nonlinear least squares. 

Some other functional forms that are useful when the data points are as 
shown in Figure 3.8 are 

„ p 

Y - a + - 

x 


or 


Y « a + 


JL 


In the first case we define X = \tx and in the second case we define X = 
1/Vjc. In both the cases the equation is linear in the variables after the trans¬ 
formation. It is 
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Y = a + pZ 

If (3 > 0, the relationship is as shown in Figure 3.8(a). If (3 < 0, the relationship 
is as shown in Figure 3.8(b). 

Some other nonlinearities can be handled by what is known as “search pro¬ 
cedures.” For instance, suppose that we have the regression equation 


y = 


a + 


P 

X 4 - 


+ u 


The estimates of a, (3, and y are obtained by minimizing 



a — 


x, + y 


We can reduce this problem to one of the simple least squares as follows: For 
each value of y, we define the variable Z, = ll(x, + y) and estimate a and (3 by 
minimizing 

2 (y. ~ « - Pz;) 2 

We look at the residual sum of the squares in each case and then choose that 
value of y for which the residual sum of squares is minimum. The correspond¬ 
ing estimates of a and (3 are the least squares estimates of these parameters. 
Here we are “searching” over different values of y. This search would be a 
convenient one only if we had some prior notion of the range of this parameter. 
In any case there are convenient nonlinear regression programs available now¬ 
adays. Our purpose here is to show how some problems that do not appear to 
fall in the framework of simple regression at first sight can be transformed into 
that framework by a suitable redefinition of the variables. 




(b) p < 0 a > 0 


Figure 3.8. (a) y decreases with increasing x; (b) y increases with increasing x. 
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Illustrative Example 

In Table 3.11 we present data on output, labor input, and capital input for the 
United States for the period 1929—1967. The variables are 

X = index of gross national product in constant dollars 

jL, = labor input index (number of persons adjusted for hours of work 
and educational level) 

L z = persons engaged 

K, = capital input index (capital stock adjusted for rates of utilization) 

K 2 = capital stock in constant dollars 

Since output depends on both labor and capital inputs we have to estimate a 
regression equation of X on L and K. This is what we will illustrate in Chapter 
4. For the present let us study the relationship between output and labor input, 
and output and capital input. We will use L, for labor input and K x for capital 
input. 

From the data it is clear that X has risen faster than L, or K { . Hence we have 
a situation depicted in Figure 3.7(b). This suggests a semilogarithmic form or a 
double-logarithmic form. The advantage with the latter is that we can interpret 
the coefficients as elasticities. Moreover, even though X has been rising faster 
than L % (or it,), it has not been rising that fast to justify a regression of log X 
on L, (or it,). We will therefore estimate the equation in the double-logarithmic 
form: 


log X = a + ^ log L, + u 

A regression of X on L, gave the following results (figures in parentheses are 
standard errors): 

1 = -338.01 + 3.052 L, r 2 = 0.9573 

( 23 . 55 ) ( 0 . 106 ) 

When we examined the residuals, 9 they were positive for the first 12 observa¬ 
tions, negative for the middle 17 observations, and positive again for the last 
10 observations. This is what we would expect if the relationship is as shown 
in Figure 3.7(b) and we estimate a linear function. 

The results of a regression of log X on log L, gave the following results (fig¬ 
ures in parentheses are standard errors): 

log 1 = -5.480 - 2.084 log L, r 2 = 0.9851 

(0 226) (0.042) 

This equation still exhibited some pattern in the residuals, although not to the 
same extent as the linear form. 


"To economize on space, we are not presenting the residuals here. We present residuals for some 
multiple regression equations in Chapter 12. 
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Table 3.11 Output, Labor Input, and Capital Input 
for the United States, 1929-1967 


Year 

X 

L, 

l 2 


k 2 

1929 

189.8 

173.3 

44.151 

87.8 

888.9 

1930 

172.1 

165.4 

41.898 

87.8 

904.0 

1931 

159.1 

158.2 

36.948 

84.0 

900.2 

1932 

135.6 

141.7 

35.686 

78.3 

883.6 

1933 

132.0 

141.6 

35.533 

76.6 

851.4 

1934 

141.8 

148.0 

37.854 

76.0 

823.7 

1935 

153.9 

154.4 

39.014 

77.7 

805.3 

1936 

171.5 

163.5 

40.765 

79.1 

800.4 

1937 

183.0 

172.0 

42.484 

80.0 

805.5 

1938 

173.2 

161.5 

40.039 

77.6 

817.6 

1939 

188.5 

168.6 

41.443 

81.4 

809.8 

1940 

205.5 

176.5 

43.149 

87.0 

814.1 

1941 

236.0 

192.4 

46.576 

96.2 

830.3 

1942 

257.8 

205.1 

49.010 

104.4 

857.9 

1943 

277.5 

210.1 

49.695 

110.0 

851.4 

1944 

291.1 

208.8 

48.668 

107.8 

834.6 

1945 

284.5 

202.1 

47.136 

102.1 

819.3 

1946 

274.0 

213.4 

49.950 

97.2 

812.3 

1947 

279.9 

223.6 

52.350 

105.9 

851.3 

1948 

297.6 

228.2 

53.336 

113.0 

888.3 

1949 

297.7 

221.3 

51.469 

114.9 

934.6 

1950 

328.9 

228.8 

52.972 

124.1 

964.6 

1951 

351.4 

239.0 

55.101 

134.5 

1021.4 

1952 

360.4 

241.7 

55.385 

139.7 

1068.5 

1953 

378.9 

245.2 

56.226 

147.4 

1100.3 

1954 

375.8 

237.4 

54.387 

148.9 

1134.6 

1955 

406.7 

245.9 

55.718 

158.6 

1163.2 

1956 

416.3 

251.6 

56.770 

167.1 

1213.9 

1957 

422.8 

251.5 

56.809 

171.9 

1255.5 

1958 

418.4 

245.1 

55.023 

173.1 

1287.9 

1959 

445.7 

254.9 

56.215 

182.5 

1305.8 

1960 

457.3 

259.6 

56.743 

189.0 

1341.4 

1961 

466.3 

258.1 

56.211 

194.1 

1373.9 

1962 

495.3 

264.6 

57.078 

202.3 

1399.1 

1963 

515.5 

268.5 

57.540 

205.4 

1436.7 

1964 

544.1 

275.4 

58.508 

215.9 

1477.8 

1965 

579.2 

285.3 

60.055 

225.0 

1524.4 

1966 

615.6 

297.4 

62.130 

236.2 

1582.2 

1967 

631.1 

305.0 

63.162 

247.9 

1645.3 


Source: L. R. Christensen and D. W. Jorgenson, "U.S. 
Real Product and Real Factor Input 1929-67,” Review of 
Income and Wealth, March 1970. 
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The log-linear form is more easy to interpret. The elasticity of output with 
respect to the labor input is about 2. This is, of course, very high, but this is 
because of the omission of the capital input. In Chapter 4 we estimate the pro¬ 
duction function with the capital input included. 

In this particular example, estimation in the linear form is not very meaning¬ 
ful. A linear production function assumes perfect substitutability between cap¬ 
ital and labor inputs. However, it is given here as an example illustrating Figure 
3.7(b). 

Note that r 2 is higher in the log-linear form than in the linear form. But this 
comparison does not mean anything because the dependent variables in the 
two equations are different. (This point is discussed in greater detail in Section 
5.6 of Chapter 5.) 

Finally, in the case of a simple regression it is always easy to plot the points 
on a graph and see what functional form is the most appropriate for the prob¬ 
lem. This is not possible in the case of multiple regression with several ex¬ 
planatory variables. Hence greater use is made of the information from the 
residuals. 

*3.i0 Inverse Prediction in the Least 
Squares Regression Model 

At times a regression model of y on x is used to make a prediction of the value 
of x which could have given rise to a new observation y. As an illustration 
suppose that we have data on sales and advertising expenditures for a number 
of firms in an industry. There is a new firm whose sales are known but adver¬ 
tising expenditures are not. In this case we would want to get an estimate of 
the advertising expenditures of this firm. 

In this example we have the estimated regression equation 

y = a + (3x 

and given a new value y 0 of y we have to estimate the value x 0 of x that could 
have given rise to », and obtain a confidence interval for x 0 . 

To simplify the algebra we will write x' = x — x and y' = y - y. Then the 
estimated regression equation can be written as 

y' = px' (3.12) 

We are given the value y 0 of y and we are asked to estimate the value x 0 of x 
that could have given rise to y 0 . Define y' 0 = y fl — 9 and x' 0 = x 0 — x. The 
estimate of x' 0 given from equation (3.12) is 



Optional section. 
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The main problem is that of obtaining confidence limits for x' 0 because both y'„ 
and p are normally distributed. One can use a method suggested by Fieller for 
this purpose. 10 This method can be used to obtain confidence intervals for ra¬ 
tios of estimators. These problems also arise in our discussion of simultaneous 
equation models and distributed lag models in later chapters. Another problem 
that falls in this same category is to estimate the value of x at which the value 
of y is zero. This is given by 



and & and (3 have joint normal distribution. 

Let us define 0 = E(y' 0 l p) (0 is just x' 0 and is thus a constant). Then the 
variable y' 0 - 0p is normally distributed with mean zero and variance 

ct 2 ( 1 + ~ | + 0 2 ~~~ (3.14) 

V n) S xx 

[since var(y' 0 ) = var(y 0 - y) = ct 2 + a 2 /n and cov(y 0 , p) = 0]. Substituting 
the estimate a 2 of o 2 derived earlier into (3.14), we get the estimated variance. 
Thus 

t _ y'o - ep 

Vct 2 (1 + l/n) + 0 2 (d 2 /5,,) 

has a /-distribution with degrees of freedom (n - 2). This can be used to con¬ 
struct a confidence interval for 0, which is what we need. For instance, to get 
a 95% confidence interval for 0, we find a value /„ of / (from the /-tables for 
n — 2 degrees of freedom) such that 

Prob(/ 2 < t\) = 0.95 

To get the limits for 0 we solve the quadratic equation 

(y'o - p0) 2 - /^1 + \ + f^) = 0 (3.15) 

The roots 0, and 0 2 of this equation give the confidence limits. 

As an illustration, suppose that we have estimated a consumption function 
on the basis of 10 observations and the equation is 11 

y = 10.0 + 0.9x 

and x = 200, y = 185, a 2 = 5.0, and S xx = 400. Consider the predicted value 
of x for y = 235. This is clearly x 0 = 250. 


,0 C. E. Fieller, “A Fundamental Formula in the Statistics of Biological Assay and Some Appli¬ 
cations,” Quarterly Journal of Pharmacy, 1944, pp. 117-123. 

"We have changed the illustrative example slightly from the one we discussed in Section 3.7 
because with tr = 0.01 and S,, = 4000 we get a symmetric confidence interval for 0. We have 
changed the example to illustrate the point that we need not get a symmetric confidence inter¬ 
val. 
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To get the confidence interval for x 0 we solve equation (3.15). Note that y' 0 
= 235 — 185 = 50 and t 0 = 2.306 from the /-tables for 8 degrees of freedom. 
Thus we have to solve the equation 

(50 - O.90) 2 - (2.306) 2 (5)^1 + ^ + ^) = 0 

or 

2500 - 906 + 0.816 2 = 29.25 + 0.0666 2 
or 

0.7446 2 - 906 + 2470.75 = 0 

The two roots are 0, = 42.12 and 0 2 = 78.85. Thus the 95% confidence interval 
for x' f) is (42.12, 78.85) or for x n it is (242.12, 278.85). Note that the confidence 
interval obtained by the Fieller method need not be symmetric around the pre¬ 
dicted point value (which in this case is x 0 = 250). 

If the samples are large, we can use the asymptotic distributions and this will 
give symmetric intervals. However, for small samples we have to use the 
method described here. 

*3.11 Stochastic Regressors 


In Section 3.2 we started explaining the regression model saying that the values 
jc „ jc 2 , . . . , x n of x are given constants, that is, x is not a random variable. This 
view is perhaps valid in experimental work but not in econometrics. In econ¬ 
ometrics we view x as a random variable (except in special cases where x is 
time or some dummy variable). In this case the observations jc„ x 2 , . . . , x„ are 
realizations of the random variable x. This is what is meant when we say that 
the regressor is stochastic. 

Which of the results we derived earlier are valid in this case? First, regarding 
the unbiasedness property of the least squares estimators, that property holds 
good provided that the random variables x and u are independent. (This is 
proved in the appendix to this chapter.) Second, regarding the variances of the 
estimators and the tests of significance that we derived, these have to be viewed 
as conditional on the realized values of x. 

To obtain any concrete results, with stochastic regressors, we need to make 
some assumptions about the joint distribution of y and x. It has been proved 
that when y and x are jointly normally distributed, the formulas we have de¬ 
rived for the estimates of the regression parameters, their standard errors, and 
the test statistics are all valid. Since the proof of this proposition is beyond the 
scope of this book, we will not pursue it here. 12 


,2 In the case of multiple regression, proof of this proposition can be found in several papers. A 
clear exposition of all the propositions is given in: Allan R. Simpson, “A Tale of Two Regres¬ 
sions,” Journal of the American Statistical Association, September 1974, pp. 682-689. 
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Thus, with stochastic regressors the formulas we have derived are valid if y 
and x are jointly normally distributed. Otherwise, they have to be considered 
as valid conditional on the observed jc’s. 

*3.12 The Regression Fallacy 


In the introduction to this chapter we mentioned a study by Galton, who ana¬ 
lyzed the relationship between the height of children and the height of parents. 
Let 


x = mid-parent height 

y = mean height (at maturity) of all children whose 
mid-parent height is x 

Galton plotted y against x and found that the points lay close to a straight line 
but the slope was less than 1.0. What this means is that if the mid-parent height 
is 1 inch above x, the childrens’ height (on the average) is less than 1 inch above 
y. There is thus a “regression of childrens’ heights toward the average.” 

A phenomenon like the one observed by Galton could arise in several situ¬ 
ations where y and x have a bivariate normal distribution and thus is a mere 
statistical artifact. That is why it is termed a “regression fallacy.” To see this, 
first we have to derive the conditional distributions f(x\y) and/(>jx) when x and 
y are jointly normal. We will show that both these conditional distributions are 
normal. 


The Bivariate Normal Distribution 


Suppose that X and Y are jointly normally distributed with means, variances, 
and covariance given by 


E(X) = m x E(Y) = m V(X) = cx 2 V(V) = cr 2 


and 


cov(A, Y) = pcr x a y 

Then the joint density of X and Y is given by 


fix, y) = (2mr t cr } ,V / 1 - p 2 ) 1 exp(0 


where 

Q 


l 


2(1 - p 2 ) 


x — m. 






Now completing the square in x and simplifying, we get 
1 (x — m, y — m.y 1 


Q - 


2(1 - p 2 ) 


( x ~ m x 
\ 


I ( y ~ 

2 \ ^ / 
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Thus 


where 


fix, y ) = f(x\y)f(y) 


fiy) 


V2ttct, 


exp 


~h, iy ~ m ’ f . 


and 

/W ° V2W.V1 - pi exp 

where 


1 


2<x 2 (l - p 2 ) 


ix - m ) 2 


pc, . . 

m x . y = m x + — iy - niy) 

°y 

Thus we see that the marginal distribution of y is normal with mean m v and 
variance cr 2 . The conditional distribution of x given y is also normal with 

mean = m, H- (y — mj 

<T y 

and 


variance = cr 2 (l — p 2 ) 

The conditional distribution of y given x is just obtained by interchanging x and 
y in the foregoing relationships. 

Thus for a bivariate normal distribution, both the marginal and conditional 
distributions are univariate normal. 13 Note that the converse need not be true, 
that is, if the marginal distributions of X and Y are normal, it does not neces¬ 
sarily follow that the joint distribution of X and Y is bivariate normal. In fact, 
there are many nonnormal bivariate distributions for which the marginal dis¬ 
tributions are both normal. 14 


Galton’s Result and the Regression Fallacy 

Consider now the mean of F for given value of X. We have seen that it is given 
by 


E(Y\X = x) = m y + 


P5 


(X - m x ) 


,3 This result is more general. For the multivariate normal distribution we can show that all 
marginal and conditional distributions are also normal. 

I4 C. J. Kowalski, "Non-normal Bivariate Distributions with Normal Marginals,” The American 
Statistician, Vol. 27, No. 3, June 1973, pp. 103-106; K. V. Mardia, Families of Bivariate Dis¬ 
tributions (London: Charles Griffin, 1970). 
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The slope of this line is pcr v /cr t and if cr x — cr,, since p < 1 we have the result 
that the slope is less than I, as observed by Galton. 

By the same token, if we consider E{X \ Y = y) we get 

E(X | Y = y) = m x + & (y - m v ) 

ci v 

Since we have assumed that o\ = cr,, the slope of this line is also less than unity 
(note that we are taking dxldy as the slope in this case). Thus if Galton had 
considered the conditional means of parents’ heights for given values of off¬ 
springs’ heights, he would have found a “regression” of parents’ heights to¬ 
ward the mean. It is not clear what Galton would have labeled this regression. 

Such “regression” toward average is often found when considering variables 
that are jointly normally distributed and that have almost the same variance. 
This has been a frequent finding in the case of test scores. For example, if 

x = score on the first test 


and 


y = score on the second test 

then considering the conditional means of y for given values of x shows a 
regression toward the mean in the second test. This does not mean that the 
students’ abilities are converging toward the mean. This finding in the case of 
test scores has been named a regression fallacy by the psychologist Thorn¬ 
dike. 15 

This, then, is the story of the term “regression.” The term as it is used now 
has no implication that the slope be less than 1.0, nor even the implication of 
linearity. 


Summary 


1. The present chapter discusses the simple linear regression model with one 
explained variable and one explanatory variable. The term regression literally 
means “backwardation,” but that is not the way it is used today, although that 
is the way it originated in statistics. A brief history of the term is given in 
Section 3.1, but it is discussed in greater detail in Section 3.12 under the title 
“regression fallacy.” 

2. Two methods—the method of moments (Section 3.3) and the method of 


,5 F. L. Thorndike, “Regression Fallacies in the Matched Group Experiment,” Psychometrika, 
Vol. 7, No. 2, 1942, pp. 85-102. 
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least squares (Section 3.4)—are described for the estimation of the parameters 
in the linear regression model. A third method, the method of maximum like¬ 
lihood (ML), is presented in the appendix. For the normal linear regression 
model all of them give the same results. 

3. Given two variables v and x there is always the question of whether to 
regress y on x or to regress x on y. This question can always be answered if we 
know how the data were generated or if we know the direction of causality. In 
some cases both the regressions make sense. These issues are discussed in the 
latter part of Section 3.4. 

4. In Sections 3.5 and 3.6 we discuss the procedures of obtaining 

(a) Standard errors of the coefficients. 

(b) Confidence intervals for the parameters. 

(c) Tests of hypotheses about the parameters. 

The exact formulas need not be repeated here because they are summarized in 
the respective sections. The main results that need to be presented are: 

(a) The estimates of the regression parameters with their standard errors. 
Sometimes the /-ratios are given instead of the standard errors. If we are 
interested in obtaining confidence intervals, the standard errors are more 
convenient. If we are interested in tests of hypotheses, presentation of the 
/-ratios is sometimes more convenient. 

(b) The coefficient of determination r. 

(c) SEE or SER. This is an estimate of the standard deviation a of the error 
term. 

However, these statistics by themselves are not sufficient. In Section 3.8 we 
give an example of four different data sets that give the same regression output. 

5. While considering the predictions from the linear regression model, it is 
important to note whether we are obtaining predictions for the particular value 
of y or for the mean value of y. Although the point prediction is the same for 
the two cases, the variance of the prediction error and the confidence intervals 
we generate will be different. This is illustrated with an example in Section 3.7. 
Sometimes we are interested in the inverse prediction: prediction of x given y. 
This problem is discussed in Section 3.10. 

6. In regression analysis it is important to examine the residuals and see 
whether there are any systematic patterns in them. Such analysis would be 
useful in detecting outliers and judging whether the linear functional form is 
appropriate. The problem of detection of outliers and what to do with them is 
discussed in Section 3.8. In Section 3.9 we discuss different functional forms 
where the least squares model can be used with some transformations of the 
data. 

7. Throughout the chapter, the explanatory variable is assumed to be fixed 
(or a nonrandom variable). In Section 3.10 (optional) we discuss briefly what 
happens if this assumption is relaxed. 

8. The last three sections and the appendix, which contains some derivations 
of the results and a discussion of the ML estimation method and the LR test 
can be omitted by beginning students. 
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Exercises 


(More difficult exercises are marked with a *) 

1. Comment briefly on the meaning of each of the following. 

(a) Estimated coefficient. 

(b) Standard error. 

(c) I-statistic. 

(d) squared. 

(e) Sum of squared residuals. 

(f) Standard error of the regression. 

(g) Best linear unbiased estimator. 

2. A store manager selling TV sets observes the following sales on 10 differ¬ 
ent days. Calculate the regression of y on x where 

y = number of TV sets sold 
x = number of sales representatives 


y 

3 

6 

10 

5 

10 

12 

5 

10 

10 

8 

X 

i 

1 

1 

2 

2 

2 

3 

3 

3 

2 


Present all the items mentioned in Exercise 1. 

3. The following data present experience and salary structure of University 
of Michigan economists in 1983-1984. The variables are 

y = salary (thousands of dollars) 

x = years of experience (defined as years since receiving Ph.D.) 


y 

X 

y 

X ! 

y 

X 

y 

X 

63.0 

43 

44.5 

22 

45.0 

18 

51.3 

12 

54.3 

32 

43.0 

21 

50.7 

17 

50.3 

12 

51.0 

32 

46.8 

20 

37.5 

17 

62.4 

10 

39.0 

30 

42.4 

20 

61.0 

16 

39.3 

10 

52.0 

26 

56.5 

19 

48.1 

16 

43.2 

9 

55.0 

25 

55.0 

19 

30.0 

16 

40.4 

7 

41.2 

23 

53.0 

19 

51.5 

15 

37.7 

6 

47.7 

22 

55.0 

18 

40.6 

13 

27.7 

3 


Source: R. H. Frank, “Are Workers Paid Their Marginal Products?” The American 
Economic Review, September 1984. Table 1, p. 560. 


Calculate the regression of y on x. Present all the items mentioned in Ex¬ 
ercise 1. Give reasons why the regression does or does not make sense. 
Calculate the residuals to see whether there are any outliers. Would you 
discard these observations or look for other explanations? 

4. Show that the simple regression line of y against x coincides with the 
simple regression line of x against y if and only if r 2 = 1 (where r is the 
sample correlation coefficient between x and y). 
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5. In the regression model y, = a + (3x, + u, if the sample mean of x of x is 
zero, show the cov(a, (3) = 0, where a and (3 are the least squares esti¬ 
mators of a and (3. 

6. The following are data on 

y = quit rate per 100 employees in manufacturing 
x = unemployment rate 

The data are for the United States and cover the period 1960-1972. 


Year 

y 

Jt 

Year 

y 

X 


1.3 

6.2 

1967 

2.3 

3.6 

1961 

1.2 

7.8 

1968 

2.5 

3.3 

1962 

1.4 

5.8 

1969 

2.7 

3.3 

1963 

1.4 

5.7 


2.1 

5.6 

1964 

1.5 

5.0 

1971 

1.8 

6.8 

1965 

1.9 

4.0 

1972 

2.2 

5.6 

1966 

2.6 

3.2 





(a) Calculate a regression of y on x. 

y = a + (3x + u 

(b) Construct a 95% confidence interval for (3. 

(c) Test the hypothesis H 0 : (3 = 0 against the alternative |3 ^ 0 at the 
5% significance level. 

(d) Construct a 90% confidence interval for a 1 = var(w). 

(e) What is likely to be wrong with the assumptions of the classical 
normal linear model in this case? Discuss. 

7. Let u, be the residuals in the least squares fit of y, against x, (i = 1, 2, 
. . , , n). Derive the following results: 

n n 

X «, = 0 and X *«Wi = 0 

I 3 *! l«a I 

8. Given data on y and x explain what functional form you will use and how 
you will estimate the parameters if 

(a) y is a proportion and lies between 0 and 1. 

(b) x > 0 and x assumes very large values relative to y. 

(c) You are interested in estimating a constant elasticity of demand 
function. 

9. At a large state university seven undergraduate students who are majoring 
in economics were randomly selected from the population and surveyed. 
Two of the survey questions asked were: (1) What was your grade-point 
average (GPA) in the preceding term? (2) What was the average number 
of hours spent per week last term in the Orange and Brew? The Orange 
and Brew is a favorite and only watering hole on campus. Using the data 
below, estimate with ordinary least squares the equation 

G = a + $H 
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where G is GPA and H is hours per week in Orange and Brew. What is 
the expected sign for (3? Do the data support your expectations? 


Student 

GPA, G 

Hours per Week in 
Orange and Brew, H 

1 

3.6 

3 

2 

2.2 

15 

3 

3.1 

8 ’ 

4 

3.5 

9 

5 

2.7 

12 

6 

2.6 

12 

7 

3.9 

4 


10. Two variables y and x are believed to be related by the following stochas¬ 
tic equation: 


y = a + (3x + u 

where u is the usual random disturbance with zero mean and constant 
variance <x 2 . To check this relationship one researcher takes a sample size 
of 8 and estimates (3 with OLS. A second researcher takes a different 
sample size of 8 and also estimates p with OLS. The data they used and 
the results they obtained are as follows: 


Researcher 1 

Researcher 2 

y 

X 

y 

X 

4.0 

3 

2.0 

1 

4.5 

3 

2.5 

1 

4.5 

3 

2.5 

1 

3.5 

3 

1.5 

1 

4.5 

4 

11.5 

10 

4.5 

4 

10.5 

10 

5.5 

4 

10.5 

10 

5.0 

4 

11.0 

10 

t"* 

00 

II 

+ 0.750.x 

y = 1.5 

+ 0.970x 

(I 20) 

(0 339) 

(0 27) 

(0 038) 

r 2 = 0.45 


r 2 = 0.99 


a = 0.48 


a = 0.48 



Can you explain why the standard error of p for the first researcher is 
larger than the standard error of p for the second researcher? 

11. Since the variance of the regression coefficient (3 varies inversely with the 
variance of x, it is often suggested that we should drop all the observa¬ 
tions in the middle range of x and use only the extreme observations on x 
in the calculation of (3. Is this a desirable procedure? 
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12. Suppose you are attempting to build a model that explains aggregate sav¬ 
ings behavior as a function of the level of interest rates. Would you rather 
sample during a period of fluctuating interest rates or a period of stable 
interest rates? Explain. 

13. A small grocery store notices that the price it charges for oranges varies 
greatly throughout the year. In the off-season the price was as high as 60 
cents per orange, and during the peak season they had special sales where 
the price was as low as 10 cents, 20 cents, and 30 cents per orange. Below 
are six weeks of data on the quantity of oranges sold (y) and price (x): 


Oranges Sold, 
y (100s) 

Price per Orange, 
x (cents) 

6 

10 

4 

20 

5 

30 

4 

40 

3 

50 

1 

60 


Assuming that the demand for oranges is given by the linear equation 

y = a + (3x + u 

estimate the parameters of this equation. Calculate a 90% confidence in¬ 
terval for the quantity of oranges sold in week seven if the price is 25 
cents per orange during that week. 

14. In Exercise 9 we considered the relationship between grade-point average 
(GPA) and weekly hours spent in the campus pub. Suppose that a fresh¬ 
man economics student has been spending 15 hours a week in the Orange 
and Brew during the first two weeks of class. Calculate a 90% confidence 
interval for his GPA in his first quarter of college if he continues to spend 
15 hours per week in the Orange and Brew. Suppose that this freshman 
remains in school for four years and completes the required 12 quarters 
for graduation. Calculate a 90% confidence interval for his final cumula¬ 
tive GPA. Suppose that the minimum requirements for acceptance into 
most graduate schools for economics is 3.25. What are the chances that 
this student will be able to go to graduate school after completing his 
undergrate studies? 

15. A local night entertainment establishment in a small college town is trying 
to decide whether they should increase their weekly advertising expen¬ 
ditures on the campus radio station. The last six weeks of data on monthly 
revenue (y) and radio advertising expenditures (x) are given in the accom¬ 
panying table. What would their predicted revenues be next week if they 
spent $500 on radio commercials. Give a 90% confidence interval for next 
week’s revenues. Suppose that they spend $500 per week for the next 10 
weeks. Give a 90% confidence interval for the average revenue over the 
next 10 weeks. 
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Week 

Revenue, y 
(thousand of dollars) 

Radio Advertising 
Expenditure, x 
(hundreds of dollars) 

1 

1.5 

1.0 

2 

2.0 

2.5 

3 

1.0 

0.0 

4 

2.0 

3.0 

5 

3.5 

4.0 

6 

1.5 

2.0 


6. In the model y, = a + px, + u„ i = 1, . . . , N, the following sample 
moments have been calculated from 10 observations: 

Sy, = 8 5>, = 40 £y? = 26 = 200 £x,y, = 20 

(a) Calculate the predictor of y for x = 10 and obtain a 95% confidence 
interval for it. 

(b) Calculate the value of x that could have given rise to a value of y 
= 1 and explain how you would find a 95% confidence interval for 
it. 

7. C Instrumental variable method) Consider the linear regression model 

y, = a + px, + u, 

One of the assumptions we have made is that x, are uncorrelated with the 
errors u,. If x, are correlated with u„ we have to look for a variable that is 
uncorrelated with u, (but correlated with x,). Let us call this variable z,. z, 
is called an instrumental variable. Note that as explained in Section 3.3, 
the assumptions E(u) - 0 and cov(x, u) = 0 are replaced by 

]>X = 0 and Y, X A = 0 

However, since x and u are correlated we cannot use the second condi¬ 
tion. But since z and u are uncorrelated, we use the condition cov(z, u) 
= 0. This leads to the normal equations 

= 0 and = 0 

The estimates of a and (3 obtained using these two equations are called 
the instrumental variable estimates. From a sample of 100 observations, 
the following data are obtained: 

= 350 

'Zx.y, = 150 'ZxJ = 400 

= 100 ^z,x, = 200 = 400 

= 100 5>, = 100 £z, = 50 

Calculate the instrumental variables estimates of a and (3. 

Let the instrumental variable estimator of (3 be denoted by (3* and the 
least squares estimator p be (3. Show that p* = S zy /S u with S zy and S„ 
defined in a similar manner to S xx , S xy and S yy . Show that 
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var((3*) = o 2 • 





Hence, var((3*) > var(p). To test whether x and u are correlated or not, 
the following test has been suggested (Hausman’s test discussed in Chap¬ 
ter 12). 


t 


P* ~ P 

Vt>(p*) - V(P) 


~ N(0, 1) 


Apply this test to test whether cov(x, u) = 0 with the data provided. 
*18. (Stochastic regressors ) In the linear regression model 


y, = a + (3x, + u, 

Suppose that x, ~ IN(0, X 2 ). What is the (asymptotic) distribution of the 
least squares estimator (3? Just state the result. Proof not required. 

*19. Consider the regression model 
y, = a + px, + u, 

u, ~ IN(0, 1) / = 1, 2, . . . , T 

Suppose that the model refers to semiannual data, but the data available 
are either: 

1. Annual data, that is 


Jh = yi + 

Si = + ^4 

jh = y 5 + y'f, et c- 

with Jr,, x 2 , x 3 , . . . , defined analogously (assuming that T is even), or 
2. Moving average data, that is, 


yi 


?i + y 2 
2 


yl 


y 2 + y 3 

2 


with xj, xj, xj, . . . defined analogously. 

(a) What are the properties of the error term in the regression model 
with each sets of data: (y„ x,), (y„ x,), and (y*, x*)? 

(b) How would you estimate (3 in the case of annual data? 

(c) How would you estimate (3 in the case of moving average data? 

(d) Would you rather have the annual data or moving average data? 
Why? 

*20. Given data on y and x explain how you will estimate the parameters in 
the following equations by using the ordinary least squares method. Spec¬ 
ify the assumptions you make about the errors. 

(a) y = ax p 

(b) y = ae® x 
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(c) y - 

(d) y = 

(e) y = 

(f) y = 

(g) y = 

(h) y = 


a + 0 log x 

x 

ax - 0 
1 + 

a + 0 Vx 
a 4- 


x - c 


Appendix to Chapter 3 


In the following proofs we will use the compact notation ^y, for y r 


(1) Proof That the Least Squares Estimators Are Best 
Linear Unbiased Estimators (BLUE) 

Consider the regression model 


y, = 0x, + u, i = 1, 2, . . . , n 


For simplicity we have omitted the constant term. We assume that u, are in¬ 
dependently distributed with mean 0 and variance cr 2 . Since x, are given con¬ 
stants, E(y,) = 0x, and var(y,) = cr 2 . 

The least squares estimator of 0 is 


P = 


X x,y, 
2*? 


2 c,y, 


where c, = x,/^x, 2 . Thus 0 is a linear function of the sample observations y, 
and hence is called a linear estimator. Also, 


£(0) = 2c,£(y,) 



Hence 0 is an unbiased linear estimator. We have to show that this is the best 
(i.e., has minumum variance among the class of linear unbiased estimators). 
Consider any linear estimator 


P = 'Zd.y, 


If it is unbiased, we have 

£(0) = 2dMy.) = 2d,{x, p) = p 

Hence we should have 2^ x . - 1. Since y, are independent with a common 
variance cr 2 , we have 




APPENDIX TO CHAPTER 3 


115 


varfp) * 2^<x 2 

We have to find d, so that this variance is minimum subject to the condition 
that 2 d.x, = 1. 

Hence we minimize ^df - k&dfX, - 1), where X is the Lagrangean mul¬ 
tiplier. Differentiating with respect to d, and equating to zero, we get 

2d, - Ajc, = 0 or d, = 


Multiplying both sides by x, and summing over i, we get 



But 'Zd.x, * 1. Hence 


Thus we get 



which are the least squares coefficients c,. Thus the least squares estimator has 
the minimum variance in the class of linear unbiased estimators. This minimum 
variance is 


var<0) * Xcfr 2 = 2 j o 2 = 

Note that what we have shown is that the least squares estimator has the min¬ 
imum variance among the class of linear unbiased estimators. It is possible in 
some cases that we can find nonlinear estimators that are unbiased but have a 
smaller variance than the linear estimators. However, if u, are independently 
and normally distributed, then the least squares estimator has minimum vari¬ 
ance among all (linear and nonlinear) unbiased estimators. Proofs of these 
propositions are beyond our scope and hence are omitted. 


(2) Derivation of the Sampling Distributions of the Least 
Squares Estimators 
Consider the regression model 

y, = a + px, + u, u, ~ IN{0, a 2 ) 

We have seen that the least squares estimators are 

$ = and 


o = y - pjf 
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To derive the sampling distributions, we will use the following result: 

If }'i> y 2 > • • • , y„ are independent normal with variance cr 2 , and if 

L \ = and L 2 = E^.y. 

are two linear functions of y„ then L, and L 2 are jointly normally distributed 
var(L,) = or 2 E c , 2 var(L 2 ) = a 2 E dj 
and 

cov(jL,, L 2 ) = a 2 'Zc.d, 

We now write £ and a as linear functions of y,. First we write 

s„ = E(*. - *)(y, - y) 

= E(*< - x)y, - y E(*, - x) 

= E(*. ~ x)y, 

The last term is zero since E(*< — x) = 0. Thus 


P = -f - = E c -y< 


where c, = (x, - x)/S xx . Also, 


where 


6 = s - - '-Zy. - * = 

n S It 


1 x(x, — x) 


n S r 


var(p) = = r ~— 2 1 ( x , - x) 2 = 

var(a) = ^d 2 cr 2 

= a 2 E A + (A - ) (*. - x ) 2 ~ ~ A - (*. ~ *)] 

L« V-W n S xx J 


1 1 


E(*, - x) = o E(*. - x ) 2 = E — = 


/I X 2 

var (a) = ct 2 ( - + — 
V* 5®. 

cov(a, (3) = 'Zc.d.v 2 


Hence 
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Also, & and {$ are unbiased estimators. So E{a) - a and £(P) = (3. We can 
show this by noting that 

E(P) = ^c,E(y,) = £c,(a + px.) = p 

(since 2 C / Ct = 0 and 2 c r*i = 1). Also, 

£(a) = E(y ) — j3x = a + |3x - fix = a 

Thus we get the following results: 

I. a has a normal distribution with mean a and variance 



2. p has a normal distribution with mean p and variance cr 2 /S xx . 

3. cov(a, P) = <j 2 (~x/S IX ). 

(3) Distribution of RSS 

Note that the derivation of the f-tests in the regression model depend on the 
result that RSS/cr has a x 2 -distribution with (n - 2) d.f. and that this distribu¬ 
tion is independent of the distribution of a and p. The exact proof of this is 
somewhat complicated. 16 However, some intuitive arguments can be given as 
follows. 

The estimated residual is 

A 

U, = y>- y, 

Note that y, = & + and y, and u are uncorrelated because 

Sm = “E«, + p 2*a = o 

by virtue of the normal equations. This proves the independence between u, 
and (&, p). Note that under the assumption of normality zero correlation im¬ 
plies independence. 

The next thing to note is that 2 M ?(° r2 has a x 2 -distribution with n d.f. and 
2**? can be partitioned into two components as follows: 

2 «? = 2(y. - « - px,) 2 = 2(y. - s, + y, - « - P*.) 2 

= 2 «/ + 2 [(“ - a) + (P - P)x,] 2 
(the cross-product term vanishes) 

= Qx + fit (say) 

,6 It is easier to prove it with the use of matrix algebra for the multiple regression model. This 
proof can be found in any textbook of econometrics. See, for instance, Maddala, Econometrics, 
pp. 456-457. 
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(2,/cr 2 has a x 2 -distribution with in - 2) d.f. and QJo 7 has a x 2 -distribution with 
2 d.f. Presentation of a detailed proof involves more algebra than is intended 
here. 


(4) The Method of Maximum Likelihood 

The method of maximum likelihood (ML) is a very general method of estima¬ 
tion that is applicable in a large variety of problems. In the linear regression 
model, however, we get the same estimators as those obtained by the method 
of least squares. 

The model is 


y ; = a + Px, + u, Uf ~ IN(0, a 2 ) 


This implies that y, are independently and normally distributed with respective 
means a + px, and a common variance ct 2 . The joint density of the observa¬ 
tions, therefore, is 


fiVo • • • . = fl 


,= i \2 ttct 2 . 


exp 


2<t 2 


0. 


a 


P *,) 2 


This function, when considered as a function of the parameters (a, p, cr 2 ) is 
called the likelihood function and is denoted by L(a, p, cr 2 ). The maximum 
likelihood (ML) method of estimation suggests that we choose as our estimates 
the values of the parameters that maximize this likelihood function. In many 
cases it is convenient to maximize the logarithm of the likelihood function 
rather than the likelihood function, and we get the same results since log L and 
L attain the maximum at the same point. 

For the linear regression model we have 


log L = S 


1 


log(2TTCT 2 ) 


2a 2 


n 2 Q 
c ~2 log ’’ 2^ 


(y, - a - Px,) 2 


where c = -(n/2)log(2iT) does not involve any parameters 
Q = 20 / - a - Px,) 2 


We will maximize log L first with respect to a and p and then with respect to 
cr. Note that it is only the third term in log L that involves a and p and maxi¬ 
mizing this is the same as minimizing Q (since this term has a negative sign). 
Thus the ML estimators of a and p are the same as the least squares estimators 
a. and 0 we considered earlier. 

Substituting a for a and 0 for p we get the likelihood function which is now 
a function of cr only. We now have 


log L(cr) = const. 



Q_ 

2a 2 


= const. 


— n log cr ' 


Q_ 

2a 2 
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where Q = ^(y, — a — fix, ) 2 is nothing but the residual sum of squares RSS. 
Differentiating this with respect to a and equating the derivative to zero, we 
get 

-5+£-o 

a cr 

This gives the ML estimator for a 2 as 

# = 2 = ™* 
n n 

Note that this is different from the unbiased estimator for a 2 which we have 
been using, namely, RSS/(« — 2). For large n the estimates obtained by the 
two methods will be very close. The ML method is a large-sample estimation 
method. 

Substituting a 2 = Q/n in log L(a) we get the maximum value of the log- 
likelihood as 

A 

it n Q n 

max log L = c - - log-- 

L n L 

n , * n . n 

= c - - log Q + - log n - - 

This expression is useful for deriving tests for a, fi, a using the ML method. 
Since we do not change the sample size n, we will just write 

max log L = const. — ~ log Q 

m 

or 

ma xL = const. (Q) " 12 = const. (RSS) ' "' 2 


(5) The Likelihood Ratio Test 

The likelihood ratio (LR) test is a general large-sample test based on the ML 
method. Let 0 be the set of parameters in the model and L(0) the likelihood 
function. In our example 0 consists of the three parameters a, p, and cr. Hy¬ 
potheses such asp = 0orp = 1, cr = 0 impose restrictions on these parame¬ 
ters. What the likelihood ratio test says is that we first obtain the maximum of 
L(0) without any restrictions and with the restrictions imposed by the hypoth¬ 
esis to be tested. We then consider the ratio 

l 

_ max L(0) under the restrictions 
max L(0) without the restrictions 

X will necessarily be less than 1 since the restricted maximum will be less than 
the unrestricted maximum. If the restrictions are not valid, X will be signifi¬ 
cantly less than 1. If they are valid, X will be close to 1. The LR test consists 
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of using - 2 log ( ,A as a x 2 with d.f. k, where k is the number of restrictions. 
Note that it is log to the base e (natural logarithm). 

In our least squares model, suppose that we want to test the hypothesis p = 
0. What we do is we obtain 

URSS = unrestricted residual sum of squares 

and 

RRSS = restricted residual sum of squares 


As derived in the preceding section, we have the unrestricted maximum of the 
likelihood function = C(URSS) " n and the restricted maximum = C(RRSS) ' ,,/2 
Thus 

_ (RRSS\ nn 
~ \URSS/ 

Hence 

-2 log,,A = n(log ( , RRSS - log, URSS) 

and we use this as a x 2 with l d.f. In the case of our simple regression model, 
this test might sound complicated. But the point is that if we want to test the 
hypothesis p = 0, note that 

RRSS = S„ 

URSS = S yy {l - r 1 ) 

Hence -2 log,A = -n log, (1 - r 2 ) = n log, [1/(1 - r 2 )]. This we use as x 2 
with 1 d.f. Of course, in the simple regression model we would not be using 
this test, but the LR test is applicable in a very wide class of situations and is 
used in nonlinear models where small sample tests are not available. 


(6) The Wald and Lagrangian Multiplier Tests 

There are two other commonly used large sample tests that are based on the 
ML method: the W (Wald) test and the LM (Lagrangian multiplier) test. We 
will derive the expressions for these test statistics in the case of the simple 
regression model. 

Note that the r-test for the hypothesis p = 0 is based on the statistic 


SE(p) 

and in deriving SE({J) we use an unbiased estimator for cr 2 . Instead, suppose 
that we use the ML estimator d 2 = RSS/n that we derived earlier. Then we get 
the Wald test. Note that since this is a large sample test we use the standard 
normal distribution, or the x 2 ~distribution if we consider the squared test sta¬ 
tistic. Thus 
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fi 2 

W = - If -— 

estimate of var (p) 

which we use as x 2 with 1 d.f. Estimate of var (P) = ct 2 /S m , where 

y . - O 

n 

Noting that p = S^/S^ we get, on simplification. 


For the LM test we use the restricted residual sum of squares (i.e., residual 
sum of squares with p = 0). This is nothing but S yy and a 1 = S yy /n. Thus, the 
LM test statistic is 

LM = nr 2 

which again has a x 2 -distribution with a l.d. 

In summary, we have 

LR = n log (1/(1 - r 2 )) 

W = nFV( 1 - r 2 ) 

LM = nr 2 

Each has a x 2 -distribution with 1 d.f. These simple formulae are useful to re¬ 
member and will be used in subsequent chapters. There is an interesting rela¬ 
tionship between these test statistics that is valid for linear regression models: 
This is 


IV 2: LR s LM 

Note that Win = F/d - r 2 ). Hence, LM/n = r 2 = (W/n)/(\ + Win). Also, LRl 
n = log(l/(l — r 2 )) = log(l + Win). For x > 0, there is a famous inequality 

> x > log,(l + x) s jc/( J + x). 

Substituting jc = Win we get 

W LR LM 

— > — >- 

n n n 


or 

W 2= LR S: LM 

What this suggests is that a hypothesis can be rejected by the W test but not 
rejected by the LM test. An example is provided in Section 4.12. 

The LR test was suggested by Neyman and Pearson in 1928. The W test was 
suggested by Abraham Wald in 1943. The LM test was suggested by C. R. Rao 
in 1948 but the name Lagrangian Multiplier test was given in 1959 by S. D. 
Silvey. The test should more appropriately be called “Rao’s Score Test” but 
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since the “LM test” terminology is more common in econometrics, we shall 
use it here. The inequality between the test statistics was first pointed out by 
Berndt and Savin in 1977. 

For the LR test we need the ML estimates from both the restricted and un¬ 
restricted maximization of the likelihood function. For the W test we need only 
the unrestricted ML estimates. For the LM test we need only the restricted 
ML estimates. Since the last estimates are the easiest to obtain, the LM test is 
very popular in econometric work. 17 

(7) Intuition Behind the LR, W, and LM Tests 

The three tests described here are all based on ML estimation. Before we dis¬ 
cuss their interrelationships, we present a few results in the theory of ML es¬ 
timation. 

1. d log L/d0 is called the score function and is denoted by 5(0). The ML 
estimator 0 of 0 is obtained by solving 5(0) = 0. 

2. The quantity £[( - d 2 log L)/d0 2 ] is called the information on 0 in the sam¬ 
ple and is denoted by 7(0). Intuitively speaking, the second derivative 
measures the curvature of the function (in this case the likelihood func¬ 
tion). The sharper the peak is, the more the information in the sample on 
0. If the likelihood function is relatively flat at the top, this means that 
many values of 0 are all almost equally likely; that is, there is no infor¬ 
mation on 0 in the sample. 

3. The expression 7(0) plays a central role in the theory of ML estimation. 

It has been proved that under fairly general conditions, the ML estimator 
0 is consistent and asymptotically normally distributed with variance 
[7(0)]“'. This quantity is also called the information limit to the variance 
or the Cramer-Rao lower bound for the variance of the estimator 0. It is 
called the lower bound because it has been shown that the variance of 
any other consistent estimator is not less than this; that is, the ML esti¬ 
mator has the least variance. 

4. In practice, we estimate 7(0) by (-5 log L/50 2 ), that is, omitting the ex¬ 
pectation part. Since the derivative of a function at a point is the slope of 
the tangent to that function at that point, (- 5 2 log L/50 2 ) is the slope of 
the score function. The question is: At what point is this slope calculated? 
For the W test, this slope is calculated at the point given by the ML 
estimate 0. For the LM test, it is calculated at the point 0 = 0 O specified 
by the null hypothesis. 

We can now show the relationship between the LR, W, and the LM tests geo¬ 
metrically. 18 Consider testing the null hypothesis 77 0 : 0 = 0 O . 


■’There is a lot of literature on the W, LR and LM tests. For a survey, see R. F. Engle, “Wald, 
Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics,” in Z. Griliches and M. D. 
Intrilligator (eds.), Handbook of Econometrics, Vol. 2 (North Holland Publishing Co., 1984). 
■‘This geometric interpretation is from A. R. Pagan, “Reflections on Australian Macro Modell¬ 
ing,” Working Paper, Australian National University, September 1981. 
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Figure 3.9 shows the graph of the score function 5(0) = 8 log L/d0. The ML 
estimator 0 is the value of 0 for which 5(0) = 0. This is shown as the point C, 
where 5(0) crosses the 0-axis. Since 

<*• 

re ^ re 

we have I 5(0) dB = log L(0) and j ° 5(0) dB = log Z,(0 O ). The LR statistic 
is LR = -2[log L(0 O ) - log 7.(0)] = 2[log L(0) - log L(0 O )]. Hence we get 


LR = 2 x (area BAC) 


Now consider approximating area BAC by a triangle. Draw a tangent to 5(0) at 
point C (i.e., 0 = 0). As mentioned earlier, 7(0) is estimated by the slope of this 
tangent, that is, by AD/AC. 


var(0) = 


1 

7(0) 


AC 

AD 



Figure 3.9. Graph of the score function and geometric derivation of the LR, W, and 
LM test statistics. LR/2 = area BAC, W/2 = area DAC, LM/2 = area BAE. 
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The Wald statistic is W = (0 - 0 o ) 2 /var(0) = AC 2 /(AC/AD) = (AC x AD). 
Hence we get 

W = 2 x (area of the triangle ADC ) 

Now draw a tangent to 5(0) at point B (i.e., 0 = 0 O ). By a similar argument, we 
get 

LM — 2 x (area of the triangle BAE ) 

Thus the W and LM statistics amount to using two different triangle approxi¬ 
mations to area BAC under the curve 5(6). They differ in computing the value 
of 7(0). The W statistic computes its value at 0 = 0, the ML estimate. The LM 
statistic computes it at 0 = 0 O and is easier to calculate because we have to 
estimate the model with fewer parameters (the restricted model with 0 = 0 O ). 

(8) Methods of Solving the Likelihood Equations 

Earlier (see item 4 in the appendix) we discussed the maximum likelihood (ML) 
estimation of the parameters in the linear regression model. There we saw that 
the ML estimates of a and (3 were the same as the least squares estimates. 

In many cases the computation of the ML estimates is not as straight-forward 
and we need some iterative techniques to compute the-m. Here we shall explain 
two of the iteration methods commonly used: the Newton-Raphson method 
and the scoring method. For ease of exposition we shall consider the case of a 
single parameter 0. The generalization to several parameters involves writing 
down the corresponding expressions in matrix notation (using the Appendix to 
Chapter 2). 

Consider a sample of n independent observations (y„ y 2 , , y„) from a 

density function/(y, 0). Let L(0) be the likelihood function. Then log 7,(0) = 

n 

2 log/(y„ 0). A necessary condition that log 7,(0) is maximum at 0 = 0 is 

0 log L = 0 

00 e = e 

The equation 0 log L/00 = 0 is called the likelihood equation. Sometimes it can 
be solved easily. But often we have to use iterative methods. 

Let 0 O be a trial value of the estimate. Then expanding (0 log L/00) and re¬ 
taining only the first power of 80 = 6 - 0 O we get 

dlog L _ 0 log L 6 2 log L 

00 00 o 00j) 

At the maximum 0 log L/80 = 0. This gives us 80. Thus, the correction for the 
next iteration is given by 
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We now proceed with the first and second derivatives at 0, and get a new esti¬ 
mate 0 2 - We continue this iterative procedure until convergence (we should also 
check the value of log L at each iteration). This is the Newton-Raphson 
method. The quantity d log L/d0 o is sometimes called the score at 0 O and is 
denoted by S(0„). Also E[-d 2 log L/50 2 ] is called the information on 0 in the 
sample and is denoted by /(0). In the method of scoring we substitute /(0 O ) for 
(-5 2 log L/50 2 ) in the Newton-Raphson method. The iteration method, in the 
method of scoring gives 

0, = 0 O + S(0 O )//(0 O ) 

The difference between the Newton-Raphson method and the method of scor¬ 
ing is that the former depends on observed second derivatives and the latter 
depends on expected values of the second derivatives. 

As an illustration of the scoring method, consider the following regression 
model. 


y, 


1 


+ u, 


X, - P 

where u, are IN(0, 1). We assume the variance of u, to be unity, since we want 
to discuss the single-parameter case here. The log likelihood function is 

log L = const - 1/2 2)^, ~ 

0 log L 


ap 




is the likelihood equation. This is a nonlinear equation in p. 


B 2 log L 

dp 2 


y 

^ Lc*. - p ) 3 


(x, - p y 


Since 


E{y.) = 


we have 


Suppose we start with an initial value p 0 . Then we calculate S(p 0 ) and /(P 0 ). 
The next value is given by 


P. = Po 


/(Po) 


We next calculate S(p,) and /(p,) and proceed with this equation until conver¬ 
gence. 



126 


3 SIMPLE REGRESSION 


The expression /(P) plays a central role in the theory of ML estimation. It 
has been proved that, under fairly general conditions, the ML estimator is con¬ 
sistent and asymptotically normally distributed with variance [/(p)] _I . Thus, 
in the above problem at the final step of the iteration we get an estimate of 
asymptotic variance of the ML estimator as [/(0)] “ 1 . 

The quantity [/(p)] 1 is also known as the information limit to the variance, 
or alternatively as the Cramer-Rao lower bound for the variance of the esti¬ 
mator 0. 

(9) Stochastic Regressors 
Consider the regression model 

y, = p*, + u, i = 1, 2, . . . , n 

Uj are independent with mean zero and common variance cr 2 . x, are random 
variables but x, and w, are independent for all / and j. We will show that the 
least squares estimator 0 is an unbiased estimator for p. We have 

0 = 2 L>’, 

where 


_ 

' S’-.*? 

Substituting the value of y ; and simplifying, we get 

0 = p + 2 c i u i 

Since x, and w, are independent for all i and j, we know that c, and u, are inde¬ 
pendent. Thus E(Cj u^ = E(Ci ) • E(u,) = 0. Since E(u,) = 0, we need not worry 
about evaluating E(c,). Hence we get the result that 

£(P) = P + E(2c,w,) = p 

Thus the least squares estimator is unbiased even if x, are random variables, 
provided that they are independent of the error terms. 

However, when we consider the variance of 0, we have to evaluate E(cj) ■ 
E(u ■). The latter term is cr 2 , but the former term is rather cumbersome to com¬ 
pute in general—hence the conclusions stated in Section 3.11. 
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4.1 Introduction 


In simple regression we study the relationship between an explained variable y 
and an explanatory variable x. In multiple regression we study the relationship 
between y and a number of explanatory variables x„ x 2 , , x k . For instance, 

in demand studies we study the relationship between quantity demanded of a 
good and price of the good, prices of substitute goods and the consumer’s in¬ 
come. The model we assume is 

y, = a + p,x h + P 2 X 2 , + • • • + p*x fe + U, i = 1, 2, . . . , n 

The errors u, are again due to measurement errors in _y and errors in the 
specification of the relationship between y and the x’s. We make the same as¬ 
sumptions about u, that we made in Chapter 3. These are: 

m 

1. E(u,) — 0. 

2. V(u) = a 2 for all 1 . 

3. u, and «, are independent for all i # j. 

4. u, and x } are independent for all i and j. 

5. u, are normally distributed for all i. 

Under the first four assumptions, we can show that the method of least 
squares gives estimators a, p,, p 2 , . . . , (3*, that are unbiased and have mini¬ 
mum variance among the class of linear unbiased estimators. (The proofs are 
similar to those m Chapter 3 and are given in the appendix.) 

Assumption 5 is needed for tests of significance and confidence intervals. It 
is not needed to prove the optimal properties of least squares estimators. 

In addition to these assumptions which are similar to those we make in the 
case of simple regression, we will also assume that x u x 2 , ■ ■ ■ , x k are not col- 
linear, that is, there is no deterministic linear relationship among them. For 
instance, suppose that we have the regression equation 

y = a + PjX, + p 2 x 2 * + « 

but x, and x 2 are connected by the deterministic linear relationship 

2X| + x 2 = 4 

then we can express x 2 in terms of x, and get x 2 = 4 - 2x, and the regression 
equation becomes 

y = a + p]X, + p 2 (4 — 2x]) + u 
- (a + 4p 2 ) + ((3, - 2f3 2 )x, + u 

Thus we can estimate (a + 4(3 2 ) and ((3, - 2p 2 ) but not a, p,, p 2 separately. 

A case where there is an exact linear relationship between the explanatory 
variables is known as exact or perfect collinearity. In the case of the two vari¬ 
ables we considered, the exact relationship implies that the correlation coeffi- 
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cient between x t and x 2 is +1 or - 1. In our analysis in this chapter we rule out 
perfect collinearity but not the case where the correlation between the vari¬ 
ables is high but not perfect. When it comes to the analysis of the effect of 
several variables x„ x 2 , . . . , x k on y, we have to distinguish between joint 
effects and partial effects. For instance, suppose that we are estimating the 
effect of price and income on quantity demanded; then we have to consider the 
joint effect of income and price and the partial effects: 

1. Effect of price on quantity demanded holding income constant. 

2. Effect of income on quantity demanded holding price constant. 

These are the problems that we will be dealing with in this chapter. If price 
and income are highly correlated with each other, it is intuitively clear that it 
would be difficult to disentangle the separate effects of the two variables. 

We start our analysis with the case of two explanatory variables and then 
present the formulas for the case of k explanatory variables. 


4.2 A Model with Two 
Explanatory Variables 

Consider the model 

y, = a + 0i* w + 02*2. + u t i = 1, 2,. . . , n (4.1) 

The assumptions we have made about the error term u imply that 

E ( u ) = 0 cov(jc,, u ) = 0 cov(jc 2 , «) = 0 

As in the case of the simple regression model discussed in Section 3.3, we can 
replace these assumptions by their sample counterparts. 

Let a, 0,, and 0 2 be the estimators of a, 0,, and 0 2 , respectively. The sample 
counterpart of u, is the residual 

a, = y, - a - 0,X|, - 0 2 x 2 , 

The three equations to determine a, 0„ and 0 2 are obtained by replacing the 
population assumptions by their sample counterparts: 


Population Assumption Sample Counterpart 


E(u) = 0 

(I/n) 2 = 0 

or 

2 «. 

= 0 

cov(m, x,) = 0 

(1/n) 2 = 0 

or 

2 *i 

= 0 

cov (u, x 2 ) = 0 

(1/n) 2 * 2 A = 0 

or 

2 *2,«, 

= 0 


These equations can also be obtained by the use of the least squares method 
and are referred to as the “normal equations.” 
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The Least Squares Method 

The least squares method says that we should choose the estimators a, (3 2 
of a, p,, p 2 so as to minimize 

Q = 2 (y, - & “ Pi*i/ - P 2 *z ,) 2 

Differentiate Q with respect to a, and p 2 and equate the derivatives to zero. 
We get 

^ = 0^>l2(y,-d - p,*„ - P 2 JC 2i )(-l) = 0 (4.2) 

ca 

¥ = 0 => I 2(y, - d - p,x„ - p 2 * 2l )(-*„) = 0 (4.3) 


-g- = 0^>2 2(y/-a - p,x„ - f3 2 X2i)(-X2i) = 0 (4.4) 

ap 2 

These three equations, as mentioned earlier, are called the “normal equa¬ 
tions.” They can be simplified as follows. Equation (4.2) can be written as 

" 2 y, = «« + Pi S *u + Pz 2 *2 ; 


or 


y - « + fMi + P 2 * 2 


(4.5) 


where 

y = ^ 2 ,y< *, = ^Zxu x 2 = ^Zx 2i 

Equation (4.3) can be written as 

2 *iiV; = « E x u + Pi 2 xl + P 2 X *1*2/ 

Substituting the value of a from (4.5) into this equation, we get 

2 *uy/ = nxiiy - p,x, - (Mz) + Pi 2 x \, + P 2 2 * 1 * 2 , (4.6) 

We can simplify this equation by the use of the following notation. Let us define 

■Si 1 = S xl - nx] S iy = 2 *1*/ - n x\$ 

Si 2 = E *1*2/ - nx 2 x 2 S’# = 2 *2*/ - nx 2 y 

S 22 = S xl - nx\ S yy = 2 y> ~ "f 

Equation (4.6) can be written as 

■Si>. = P,5„ + P 2 S( 2 (4.7) 

By a similar simplification, equation (4.4) can be written as 

S 2 y = P,S , 2 + P 2 S 22 

Now we can solve these two equations to get (3, and p 2 . We get 


(4.8) 
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S 22 Sty S i2 S 2 y 

A 

SjlS 2 y ~ S l2 S ly 


(4.9) 


where A = S„ Sjz - 5] 2 . Once we obtain (3, and p 2 we can get & from equation 
(4.5). We have 

& = y - (M, ~ fMz 

Thus the computational procedure is as follows: 

1. Obtain all the means: y, x,, x 2 . 

2. Obtain all the sums of squares and sums of products: 

2 *i.> 2 *ir* 2 » and so on. 

3. Obtain 5| 2 , S 22 , Si y , S 2yf and 5^. 

4. Solve equations (4.7) and (4.8) to get (3, and (3 2 . 

5. Substitute these in (4.5) to get a. 


In the case of simple regression we also defined the following: 

residual sum of squares = S yy - j35^ 
regression sum of squares = pS^ 


The analogous expressions in multiple 

RSS 

regression sum of squares 

RU 2 

Rj | 2 is called the coefficient of multiple determination and its positive square 
root is called the multiple correlation coefficient. The first subscript is the ex¬ 
plained variable. The subscripts after the dot are the explanatory variables. To 
avoid cumbersome notation we have written 12 instead of x,x 2 . Since it is only 
x's that have subscripts, there is no confusion in this notation. 

The procedure in the case of three explanatory variables is analogous. The 
normal equations give 

o = y - p,*i - p2X 2 - pjXj 
and 

S\y ~ P 1*^11 + $ 2*^12 + 03^13 
S 2y = Pl«S|2 + 02*^22 + 03^23 

Sjy ~ Pi S tJ + P 2 S 2 3 + P 3 S 33 



regression are 

= Syy — Pl^lj, — ^Sjy 

= p,s ly + p 2 5 2> 

_ M»v + P 2 iS 2 ,. 
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We can solve these equations by successive elimination. But this is cumber¬ 
some. There are computer programs now available that compute all these 
things once the basic data are fed in. Thus we will concentrate more on the 
interpretation of the coefficients rather than on solving the normal equations. 
Again, 

RSS = Syy — (i,5, y — — ^3^iy 

and 

»2 _ Pl^y + ftgSfr + P 3 5 3y 

*Vl23 - c 

•Syy 

Note that RSS = 5 W (1 - R 2 ) in all cases. 

One other important relationship to note, which we have also mentioned in 
Chapter 3, is that if we consider the residual u, we note that 

u, = y, - a - j3,x„ - 0^, 

Thus the normal equations (4.2)—(4.4) imply that 

2 «. = 0 ^ “Al. = 0 2 «r*2, = 0 (4.10) 

These equations imply that cov(#, x,) = 0 and cov(#, x 2 ) = 0. Thus the sum of 
the residuals is equal to zero and the residuals are uncorrelated with both x, 
and x 2 . 


Illustrative Example 

With many computer packages readily available for multiple regression analysis 
one does not need to go through the details of the calculations involved. How¬ 
ever, it is instructive to see what computations are done by the computer pro¬ 
grams. The following simple example illustrates the computations. 

In Table 4.1 we present data on a sample of five persons randomly drawn 
from a large firm giving their annual salaries, years of education, and years of 
experience with the firm they are working for. 

Y — annual salary (thousands of dollars) 

X t = years of education past high school 

X 2 = years of experience with the firm 

The means are Y = 30, X t = 5, X 2 = 10. The sums of squares of deviations 
from the respective means are 

5 n = 16 5 12 = 12 5 lv = 62 

5 22 = 10 5 2v = 52 

5 W , '= 272 


The normal equations are 
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Table 4.1 Data on Salaries, Years of Education, and Years of 
Experience 


Y 

X, 

x 2 

Y - Y 

x, - x, 

X 2 - *2 

30 

4 

10 

0 

-1 

0 

20 

3 

8 

-10 

-2 

-2 

36 

6 

11 

6 

1 

1 

24 

4 

9 

-6 

-1 

-1 

40 

8 

12 

10 

3 

2 


160, + 12p 2 = 62 
120, + 1O0 2 = 52 

Solving these equations, we get 

0, = -0.25 02 = 5.5 

a = Y - 0,Z, - 0 2 J? 2 = 30 - (-1.25) - 55 = - 23.75 

R 2 = + Pz 5 2v = 27E5 = 99g 

5 vy 272 

Thus the regression equation is 

f = -23.75 - 0.25A:, + 5.5A' 2 R 2 = 0.998 

This equation suggests that years of experience with the firm is far more im¬ 
portant than years of education (which actually has a negative sign). The equa¬ 
tion says that we can predict that one more year of experience, after allowing 
for years of education (or holding it constant) results in an annual increase in 
salary of $5500. That is, if we consider the persons with the same level of ed¬ 
ucation, the one with one more year of experience can be expected to have a 
higher salary of $5500. Similarly, if we consider two people with the same ex¬ 
perience, the one with an education of one more year can be expected to have 
a lower annual salary of $250. Of course, all these numbers are subject to some 
uncertainty, which we will be discussing in Section 4.3. It will then be clear 
that we should be dropping the variable A', completely. 

What about the interpretation of the constant term -23.75? Clearly, that is 
the salary one would get with no experience and only high school education. 
But a negative salary is not possible. What of the case when X 2 = 0, that is, a 
person just joined the firm? Again, the equation predicts a negative salary! So 
what is wrong? 

What we have to conclude is that the sample we have is not a truly represen¬ 
tative sample from all the people working in the firm. The sample must have 
been drawn from a subgroup. We have persons with experience ranging from 8 
to 12 years in the firm. So we cannot extrapolate the results too far out of this 
sample range. We cannot use the equation to predict what a new entrant would 
earn. 
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It would be interesting to see what the simple regressions in this example 
give us. We get 

Y = 10.625 + 3.875*, R 2 = 0.883 
' Y = -22.0 + 5.2* 2 R 2 = 0.994 * 

The simple regression equation predicts that an increase of one year of educa¬ 
tion results in an increase of $3875 in annual salary. However, after allowing 
for the effect of years of experience we find from the multiple regression equa¬ 
tion that it does not result in any increase in salary. Thus omission of the vari¬ 
able “years of experience” gives us wrong conclusions about the effect of years 
of education on salaries. 


4.3 Statistical Inference in the Multiple 
Regression Model 


Again we will consider the results for a model with two explanatory variables 
first and then the general model. If we assume that the errors u, are normally 
distributed, this, together with the other assumptions we have made, implies 
that u, IN(0, <t 2 ), the following results can be derived. (Proofs are similar to 
those in Chapter 3 and are omitted.) 


1. a, p„ and p 2 have normal distributions with means a, p,, p 2 , respec¬ 
tively. 

2. If we denote the correlation coefficient between jc, and x 2 by r n , then 


var(P,) 

var(p 2 ) 


a- 2 


■SuO fj2) 

a 2 

S22O - fid 


cov(P,, p 2 ) 


-<7TT 


■^12(1 ^12) 


O"* A A A 

var(a) =-1- x] var(p,) + 2x,x 2 cov(p,, (3 2 ) + x\ var(p 2 ) 

cov(a, p,) = -[x, var(P,) + x, cov(p,, p 2 )] 
cov(d, p 2 ) = -[x, cov(p„ p 2 ) + x 2 var(p 2 )] 


Comments 

1. Note that the higher the value of r, 2 (other things staying the same), the 
higher the variances of p, and p 2 . If r, 2 is very high, we cannot estimate 
P, and p 2 with much precision. 
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2. Note that S n (l - rj 2 ) is the residual sum of squares from a regression of 
x, on x 2 . Similarly, S 22 ( 1 — rj 2 ) is the residual sum of squares from a 
regression of x 2 on x { . We can now see the analogy between the expres¬ 
sions for the variances in the case of simple regression var(p,) = crVRSS,, 
where RSS, is the residual sum of squares after regressing x, on the other 
variable, that is, after removing the effect of the other variable. This re¬ 
sult generalizes to the case of several explanatory variables. In that case 
RSS, is the residual sum of squares from a regression of x, on all the other 
x’s. 

Analogous to the other results in the case of simple regression, we have the 
following results: 

3. If RSS is the residual sum of squares then RSS/cr 2 has a x 2 -distribution 
with degrees of freedom (n — 3). This result can be used to make confi¬ 
dence interval statements about ct 2 . 

4. If ct 2 = RSS/(n - 3), then E(& 2 ) = ct 2 or d 2 is an unbiased estimator for 

CT 2 . 

5. If we substitute ct 2 for ct 2 in the expressions in result 2, we get the esti¬ 
mated variances and covariances. The square roots of the estimated vari¬ 
ances are called the standard errors (to be denoted by SE). Then 

a - a Pi ~ Pi Pz ~ P 2 

SE(d) SE(3,) SES(p 2 ) 

each has a /-distribution with degrees of freedom (n — 3). 

In addition to results 3 to 5, which have counterparts in the case of simple 
regression, we have one extra item in the case of multiple regression, that of 
confidence regions and joint tests for parameters. We have the following re¬ 
sults. 

6 F = 2^ [5,|( P' ~ P,)2 + 25l2( P' ' Pl)( k ~ P2) + 52z( k ~ p2)2] haS “ 

/•-distribution with degrees of freedom 2 and (n - 3). This result can be 
used to construct a confidence region for 3, and p 2 together and to test 
Pi and p 2 together. 

We shall state later results 1 to 6 for the general case of k explanatory vari¬ 
ables. But first we will consider an illustrative example. 


Illustrative Example 

A production function is specified as 

y, - a + p,x„ + p 2 x 2 , + li, u,~ IN(0, CT 2 ) 

where 
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y = log output 
jt| = log labor input 
x 2 = log capital input 

The variables x, are nonstochastic. The following data are obtained from a sam¬ 
ple of size n = 23 (23 individual firms): 


X, 

= 10 

*2 

= 5 

y =12 

Su 

= 12 

S i2 

= 8 

S 22 = 12 

Sly 

= 10 

S 2y 

= 8 

S„ = 10 


(a) Compute a, (3,, and (3 2 and their standard errors. Present the regression 
equation. 

(b) Find the 95% confidence intervals for a, (3,, p 2 , and <r 2 and test the hy¬ 
potheses Pi = 0 and p 2 = 0 separately at the 5% significance level. 

(c) Find a 95% confidence region for (3, and (3 2 and show it in a figure. 

(d) Test the hypothesis p, = 1, p 2 = 0 at the 5% significance level. 


Solution 

(a) The normal equations are 

12p, + 8p 2 - 10 

8p, + 12pi = 8 

These give (3, = 0.7 and (3 2 = 0.2. Hence & = y - p,x, - $ 2 x 2 = 12 - 
0.7(10) - 0.2(5) = 4. 

• r»7 Ml, + P2*^2v 

K y-n - v 

Oyy 

0.7(10) + 0.2(8) M 


The residua] sum of squares RSS = 5 >3 ,(1 - R 2 ) = 10(1 - 0.86) = 1.4. Hence 


Hence we have 


RSS 

_ 1.4 

n — 3 

“ 20 

S] 2 

64 

5|j5 22 

“ 144 

,) = 12 

/ 80 \ 

1- 

\1447 


0.07 


20 

3 


Hence 
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var((3,) 

var(p 2 ) 


cr 


5 U (1 - rf 2 ) 20 

3 2 

— a 

20 


cr 


and 


cov(3„ p 2 ) = 


-o 2 rf 2 


■^12(1 ^12) 


_ — o 2 (64/144) 
8(80)/144) 

Also, since x, = 10 and x 2 = 5, we have 


To 


V(&) = o 2 


h + <1C) ' 


/_3_\ 

\ 20 / 


2(10)(5) + (5) 2 (3) 


10 


20 




7935o 2 


Substituting the estimate of a 2 , which is 0.07, in these expressions and taking 
the square roots, we get 


SE(Pj) = SE(3 2 ) = 
SE(a) = 0.78 



0.102 


Thus the regression equation is 

y = 4.0 + 0.7jc, + 0.2t 2 R 2 = 0.86 

(0 78) (0 102) (0 102) 


Figures in parentheses are standard errors. 

(b) Using the /-distribution with 20 d.f., we get the 95% confidence intervals 
for a, 3,, and 3 2 as 

a ± 2.086SE(a) = 4.0 ± 1.63 = (2.37, 5.63) 

3, ± 2.086SE(3,) = 0.7 ± 0.21 = (0.49, 0.91) 

3 2 ± 2.086SE(3 2 ) = 0.2 ± 0.21 = (-0.01,0.41) 

The hypothesis 3i = 1.0 will be rejected at the 5% significance level since 3i 

= 1.0 is outside the 95% confidence interval for 3i- The hypothesis 32 - 0 will 
not be rejected because 3 , = 0 is a point in the 95% confidence interval for 32- 
Using the x 2 distribution for 20 d.f., we have 

Probf 9 


.59 < —t" < 34.2 


cr 


= 0.95 


or 


Prob, 


20(t 2 

34.2 


< cr 2 < 


20 ct 2 \ 

9.95/ 


0.95 
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Thus the 95% confidence interval for cr 2 is 


/ 20(0.07) 20(0.07) \ 
\ 34.2 ’ 9.59 ) 


or (0.041,0.146) 


(c) The 5% point for the F-distribution with d.f. 2 and 20 is 3.49. Hence using 
result 6, we can write the 95% confidence region for (3, and 0 2 as 

[S„(P, ~ Pi) 2 + 2S 12 (P, - p,)(p 2 - 0 2 ) + S 22 (p 2 - p 2 ) 2 ] s 3.49(2d 2 ) 


or 


[12(0.7 - p,) 2 + 16(0.7 - p,)(0.2 - 0 2 ) + 12(0.2 - p 2 ) 2 ] < 3.49(2)(0.07) 

Or, dividing by 12 throughout and changing (3, - (3, to (3, - (3, and (3, - 
0 2 to 0 2 - 0 2 , we get 

(0, - 0.7) 2 + f(P, - 0.7X02 - 0.2) + (0 2 - 0.2) 2 < 0.041 

This is an ellipse with center at (0.7, 0.2). It is plotted in Figure 4.1. There are 
two important things to note: first that the ellipse will be slanting to the left if 
cov(0,, 0 2 ) < 0 as in our case, and it will be slanting to the right if cov(0,, 
0 2 ) > 0. In the case of the two explanatory variables this will depend on the 
sign of 5, 2 . The second point to note is that the limits for 0, and 0 2 that we 
obtain from the ellipse will be different from the 95% confidence limits we ob¬ 
tained earlier for 0, and 0 2 separately. The limits are (-0.07, 0.47) for 0 2 and 
(0.43, 0.97) for 0,. This is because what we have here are joint confidence 



Figure 4.1. Confidence ellipse for ft and ft in multiple regression. 
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limits. If p| and (3 2 are independent, the confidence coefficient for the joint in¬ 
terval will simply be the product of the confidence coefficients for (3, and (3, 
separately. Otherwise, it will be different. 

This distinction between separate intervals and joint intervals, and between 
separate tests and joint tests should be kept in mind when we talk of statistical 
inference in the multiple regression model. We discuss this in detail in Section 
4.8. 

(d) To test the hypothesis (3, = 1, (3 2 = 0 at the 5% confidence level we can 
check whether the point is in the 95% confidence ellipse. But since we do not 
plot the confidence ellipse all the time, we will apply the F-test discussed in 
result 6 earlier. We have 

F = 2(ji ^n(Pi — Pi) T 2 S 12 (Pi — P1XP2 ~ P2) + ^22(^2 — P2) 2 ] 

has an F-distribution with d.f. 2 and n — 3. 

The observed value F 0 is 

F ° = 2(0^07) C12(0 - 7 _ L ° )2 + 2(8)(0 ' 7 “ L0)(0 - 2 - 0) + 12 (°- 2 “ °> 2] = 43 

The 5% point from the F-tables with 2 and 20 d.f. is 3.49. Since F 0 > 3.49 we 
reject the hypothesis at the 5% significance level. 

Formulas for the General Case of 
k Explanatory Variables 

We have given explicit expressions for the case of two explanatory variables 
so as to highlight the differences between simple and multiple regressions. The 
expressions for the general case can be written down neatly in matrix notation, 
and are given in the appendix to this chapter. 

If we have the multiple regression equation with k regressors, that is, 

y, = a + (3,x,, + (3 2 x 2 , + • • • + (3 A, + u, 

then we have to make the following changesTn the earlier results: In result 2 
earlier, in the expressions for V((3,), V((3 2 ), and so on, the denominator now 
will be the residual sum of squares from a regression of that variable on all the 
other jc’s. Thus 

„ <r 

var(p,) = for i = 1, 2, . . . , k 

where RSS, is a residual sum of squares from a regression of x, on all the other 
( k - l)x’s. 

These k regressions are called auxiliary regressions. In practice we do not 
estimate so many regressions. We have given the formula only to show the 
relationship between simple and multiple regression. This relationship is dis¬ 
cussed in the next section. 
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In results 3 to 5 we have to change the degrees of freedom from (n - 3) to 
(n — k — 1). In simple regression we have k = 1 and hence d.f. = n — 2. In 
the case of two explanatory variables k = 2 and hence we used d.f. = n - 3. 
In all cases one d.f. is for the estimation of a. 

Result 6 is now 

has an F-distribution with d.f. k and (n - k - 1). This result is derived in the 
appendix to this chapter. 

Some Illustrative Examples 

We will now present a few examples using the multiple regression methods. To 
facilitate the estimation of alternative models, we are presenting the data sets 
at the end of the chapter (Tables 4.7 to 4.14). 

Example 1 

In Table 4.7 data are given on sales prices of rural land near Sarasota, Florida. 1 
The variables are listed at the end of the table. Since land values appreciated 
steadily over time, the variable MO (month in which the parcel was sold) is 
considered an important explanatory variable. Prices are also expected to be 
higher for those lots that are wooded and those lots that are closer to transpor¬ 
tation facilities (airport, highways, etc.). Finally, the price increases less than 
proportionately with acreage, and hence price per acre falls with increases in 
acreage. The estimated equation was (figures in parentheses are standard er¬ 
rors) 

log P, = 9.213 + 0.148 WL - 0.33ZM - 0.0057ZJ75 - 0.203 log A 

(0 204) (0 099) (0 011) (0 0142) (0.0345) 

- + 0.0141 MO R 2 = 0.559 

(0.0039) 

All the coefficients have the expected signs although some of them are not 
significant (not significantly different from zero) at the 5% level. 

Example 2 

In Example 1 we had some notions of what signs to expect for the different 
coefficients in the regression equation. Sometimes, even if the signs are correct, 
the magnitudes of the coefficients do not make economic sense. In which case 
one has to examine the model or the data. An example where the magnitude of 
the coefficient makes sense is the following. 

In June 1978, California voters approved what is known as Proposition 13 
limiting property taxes. This led to substantial and differential reduction in 

'These data have been provided by J. S. Shonkwiler. They are used in the paper of J. S. Shonk- 
wiler and J. E. Reynolds, “A Note on the Use of Hedonic Price Models in the Analysis of Land 
Prices at the Urban Fringe,” Land Economics, February 1986. 
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property taxes, which led to an increase in housing prices. Rosen 2 studied the 
impact of reduction in property taxes on housing prices in the San Francisco 
Bay area. Besides property taxes, there are other factors that determine hous¬ 
ing prices and these have to be taken into account in the study. Rosen therefore 
included the characteristics of the house, such as square footage, age of the 
house, and a quality index. He also included economic factors such as median 
income and transportation time to San Francisco. The estimated equation was 
(figures in parentheses are t-values, not standard errors). 

y = 0.171 + 7.275x, + 0.547* 2 + 0.00073*, + 0.0638* 4 

(2 97 ) (2 32 ) (1 34 ) (3 26 ) 

- 0.0043* 5 + 0.857* 6 R 2 = 0.897 n = 64 

(2 24 ) ( 180 ) 

change in post-Proposition 13 mean house prices 
post-Proposition 13 decrease in property tax bill on mean house 
mean square footage of house 
median income of families in the area 
mean age of house 
transportation time to San Francisco 

housing-quality index as computed by real estate appraisers 

All the coefficients have the expected signs. 

The coefficient of *, indicates that each $1 decrease in property taxes in¬ 
creases property values of $7. The question is whether this is about the right 
magnitude. Assuming that the property tax reduction is expected to be at the 
same level in the future years, the present value of a $1 return per year is Hr, 
where r is the rate of interest (also expected to remain the same). This is equal 
to $7 if /- = 14.29%. The interest rates at that time were around this level and 
thus Rosen concludes: “The capitalization rate implied by this equation is 
about 7 which is precisely the magnitude that one would expect with an interest 
rate of 12-15%.” 


where 

y 

*i 

*2 

*3 

* 4 

*5 

*6 


Example 3 

In Table 4.8 data are presented on gasoline consumption in the United States 
for the years 1947-1969. Let us define 


M(POP) 


per capita consumption of gasoline in gallons 


We will regress G on P g and y. (The variables are defined in Table 4.8.) The 
results we get are 

G = -117.70 + 0.373/^ + 0.156F R 2 = 0.953 

(-1 14 ) (0 44 ) (12 02 ) 


'Kenneth T. Rosen, “The Impact of Proposition 13 on Housing Prices in Northern California: 
A Test of the Interjurisdictional Capitalization Hypothesis,” Journal of Political Economy, 
February 1982, pp. 191-200. 
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(Figures in parentheses are /-ratios.) The price P K has the wrong sign although 
it is not significant. Only per capita income appears to be the major determinant 
of demand for gasoline. In log form the equation is 

log G = -8.72 + 0.535 log P g + 1.541 log Y R 2 = 0.940 

(- 3 . 00 ) (1 23 ) (11 21 ) 

Again P g has the wrong sign. Instead of population, suppose that we use labor 
force as divisor and define G as G = KC/ML. This can be justified on the 
argument that miles driven would be more related to L than to population. The 
results we get are 

G = -248.44 + 0.258/-',., + 0.406T R 2 = 0.925 

(-0 70 ) ( 0 . 09 ) ( 9 . 15 ) 

The variable P g still has the wrong sign. In log form the results are 

log G = -8.53 + 0.541 log P g + 1.636 log Y R 2 = 0.907 

(-2 18 ) (0 92 ) ( 8 . 82 ) 

Gilstein and Learner 3 estimate the equation using the 1947-1960 data and obtain 
G = 799.1 - 2.563 P g + 0.0616T 

( 77 . 5 ) ( 0 . 706 ) (0 0015 ) 

(Figures in parentheses are standard errors.) Also, we get, using the 1947-1960 
data, 

G = -439.4 - 2.24lP g + 0.662F R 2 = 0.979 

(- 2 . 92 ) (- 1 . 64 ) ( 22 . 25 ) 

(Figures in parentheses are /-ratios.) In log form we get 

log G = - 10.477 - 0.387 log P g + 2.472 log Y R 2 = 0.974 

(-5 67 ) (- 1 . 16 ) ( 20 . 04 ) 

In summary, the period 1947-1960 shows a price elasticity of demand slightly 
negative and an income elasticity of demand significantly greater than unity. 
Estimation of the same equation over 1947-1969 shows no responsiveness to 
price but an income elasticity of demand again significantly greater than 1. 

Further analysis using these data is left as an exercise. Students can experi¬ 
ment with this data set after going through this and the next two chapters. 

Example 4 

In Table 4.9 data are presented on the demand for food in the United States. 
Let us try to estimate the price elasticity and income elasticity of demand for 
food. A regression of Q D on P D and Y gave the following results: 

Q d = 92.05 - 0.1421^ + 0.236T R 2 = 0.7813 

( 15 . 76 ) (-2 13 ) (7 56 ) 

(Figures in parentheses /-ratios.) The coefficients have the right signs. How¬ 
ever, someone comes and says that the variables all have trends and we ought 

’C. Z. Gilstein and E. E. Learner, “Robust Sets of Regression Estimates,” Econometrica, Vol 
51, No. 2, March 1983, p. 330. 
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to account for this by including T as an extra explanatory variable. The results 
now are (all variables have the right signs). 

Q d = 105.1 - 0.325P D + 0.315 Y - 0.224T R 2 = 0.8812 

(18 47 ) (-4 58 ) (9 85 ) (3 67 ) 

Now the price variable is more significant than before. Also, a simple regres¬ 
sion of Q d on P D gives a wrong sign for P D : 

Q d = 89.97 + 0.107Pn R 2 = 0.044 

(7 89 ) (0 91 . 

We have given four examples, the first two on cross-section data, the other 
two on time-series data. The results were more straightforward with the cross- 
section data sets. With the time-series data sets, we have had some trouble 
obtaining meaningful results. This is a common experience. The problem is that 
most of the series move together with time. 


4.4 Interpretation of the 
Regression Coefficients 


In simple regression we are concerned with measuring the effect of the explan¬ 
atory variable on the explained variable. Since the regression equation can be 
written as 


y ~ y = P(* - x) + u, 

this effect is measured by (3, where 


P 



or 


cov(x, y) 
V0c) 


In the multiple regression equation with two explanatory variables x, and x 2 we 
can talk of the joint effect of x, and x 2 and the partial effect of x, or x 2 upon y. 
Since the regression equation can be written as 

y - y = Pi(*i - * 1 ) + p 2 (*2 - * 2 ) + 

the partial effect of x, is measured by (3, and the partial effect of x 2 is measured 
by (3 2 . By partial effect we mean holding the other variable constant or after 
eliminating the effect of the other variable. Thus 3, is to be interpreted as mea¬ 
suring the effect of x, on y after eliminating the effect of x 2 on x,. Similarly, {3 2 
is to be interpreted as measuring the effect of x 2 on y after eliminating the effect 
of x, on x 2 . 

This interpretation suggests that we can derive the estimator (3, of 3, by es¬ 
timating two separate simple regressions: 


Step 1. Estimate a regression of x, on x 2 . Let the regression coefficient be 
denoted by b n . Denoting the residuals from this equation by W„ we have 


W, = x,, - x, - fc )2 (x 2 , - x 2 ) 


(4.11) 
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Note that W, is that part of x, that is left after removing the effect of x 2 on x,. 
Step 2. Now regress y, on W,. The regression coefficient is nothing but p,, 
which we derived earlier in the multiple regression [see (4.9)]. To see this, 
note that the regression coefficient of y, on W, is nothing but cov(y„ VT,)/ 
var(W,). From (4.11) we have (since 2 VV, = 0) 

nViWJ = 2 VV? = 5 n + b\ 2 S 22 - 2 b n S n 
but b n = S l2 /S 22 , hence nV(W,) = 5,, - SyS 22 . Also, 

n cov(y„ W,) = 2 F/W, 

= S ty — b\ 2 S 2y 

Now substitute b n = S l2 /S 22 . We get, on simplification,' 

n cov(y„ W,) = S Xy - S> f ly 

a 22 

Hence we get 

n covCy,, Wj) = 5 22 5 Iv - S l2 S 2y 
nV(W t ) S U S 22 - S} 2 

which is the expression for (3, we got in (4.9). 

Suppose that we eliminate the effect of x, on y as well. Let V, be the residual 
from a regression of y on x 2 . If we now regress V, on W it then the regression 
coefficient we get will be the same as that obtained from a regression of y, on 
W h This is because 


V,- = yi - y - b y2 {x 2i - x 2 ) 

where b y2 is the regression coefficient of v on x 2 . However, x 2 will be uncorre¬ 
lated with W (a residual is uncorrelated with a regressor). Hence 

cov(V, W) - cov(y, W) 

Thus a regression of V, on W, will produce the same estimate (3, as a regression 
of y, on W r Of course, the standard errors of (3, from the two regressions will 
be different because 


var(V) < var(y) 

This result is important and useful in trend elimination and seasonal adjust¬ 
ment of time-series data. What it implies is that if we have an explained variable 
y and an explanatory variable x and there is another nuisance variable Z that 
is influencing both y and x, the “pure” effect of x on y after eliminating the 
effect of this nuisance variable Z on both x and y can be estimated simply by 
estimating the multiple regression equation 


y = a + (3x + 'yZ+« 
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The coefficient p gives us the “pure” effect needed. We do not have to run a 
regression of y on Z and x on Z to eliminate the effect of Z on these variables 
and then a third regression of “purified” y on “purified” x. 

Although we have proved the result for only two variables, the result is gen¬ 
eral. x and Z can both be sets of variables (instead of single variables). 4 

Finally, the standard error of (3, obtained from the estimation of the multiple 
regression (4.1) is also the same as the standard error of the regression coeffi¬ 
cient obtained from the simple regression of y, on W,. This result is somewhat 
tedious to prove here, so we will omit the proof. 5 

The correlation coefficient between y and W is called the partial correlation 
between y and jc,. It is partial correlation in the sense that the effect of x 2 has 
been removed. It will be denoted by ,. The subscript after the dot denotes 
the variable whose effect has been removed. Corresponding to the relationship 
we have given in Chapter 3 (end of Section 3.6), we have the formula 

, = ij 
ry " 2 /? + («- 3) 

where /, = p,/SE(p,). The derivation of this formula is also omitted here since 
it is somewhat tedious. 

The interpretation of p 2 and the formulas associated with it are similar. Thus 
we have 


where t 2 = p 2 /SE(p 2 ). 


^ 2-1 


t\ + 


h _ 

(n - 3) 


Illustrative Example 


In the illustrative example in Section 4.3 we have 


0.7 

0.102 


6.863 


h 


0.2 

0.102 


1.961 


Hence 


^ 1-2 
^2 1 


(6.863) 2 = 47.10 

(6.863) 2 + 20 ~ 67.10 = 

(1.961) 2 = 3.846 

(1.961) 2 + 20 “ 23.846 


= 0.70 

= 0.16 




4 For a general model this is proved in the appendix to this chapter under the title “Prior Ad¬ 
justment.” 

'The proof can easily be constructed using the result in G. S. Maddala, Econometrics (New 
York: McGraw-Hill, 1977), p. 462. 
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The General Case 

In the general case of k regressors we define t, = |3,/SE(3,) for / = 1,2,...,/:. 
Then the partial r 2 for the variable x, is 


r 2 Xi • other x’s 


t 2 + (n — k - 1) 


4.5 Partial Correlations and 
Multiple Correlation 


If we have an explained variable y and three explanatory variables x u x 2 , x 3 
and rj,, ty 2 , r 2 3 are the squares of the simple correlations between y and x,, x 2 , 
x 3 respectively, then r 2 ,, r\ 2 , and r 2 , measure the proportion of the variance in 
y that x, alone, x 2 alone, or x 3 alone explain. On the other hand, R 2 123 measures 
the proportion of the variance of y that x 1( x 2 , x 3 together explain. We would 
also like to measure something else. For instance, how much does x 2 explain 
after x, is included in the regression equation? How much does x 3 explain after 
x, and x 2 are included? These are measured by the partial coefficients of deter¬ 
mination r 2 2 1 and r 2 312 respectively. 6 The variables after the dot are the vari¬ 
ables already included. With three explanatory variables we have the following 
partial correlations: r yt 2 , r vl 3 , r y2 u r y2 . 3 , r y 3 .,, and r y3 2 . These are called partial 
correlations of th e first order. We also have three partial correlation coefficients 
of the second order: r u 23 , r y2 13 , and r v312 . The variables after the dot are always 
the variables already included in the regression equation. The order of the par¬ 
tial correlation coefficient depends on the number of variables after the dot. 
The usual convention is to denote simple and partial correlations by a small r 
and multiple correlations by capital R. For instance, 7? 2 12 , R 2 V 13 , and R] 123 are 
all coefficients of multiple determination (their positive square roots are mul¬ 
tiple correlation coefficients). 

How do we compute the partial correlation coefficients? For this we use the 
relationship between r 2 and t 2 . For example, to compute r 2 2 3 we have to con¬ 
sider the multiple regression of y on x 2 and x 3 . Let the estimated regression 
equation be 

y = a + p2X 2 + PjXj 

Let t 2 = p 2 /SE(3 2 ) from this equation. Then 

H =_ l i _ 

' t\ + d.f. 

The degrees of freedom d.f. = (number of observations) — (number of regres¬ 
sion parameters estimated) = (n — 3) in this case. The variables to be included 


6 We are using the term coefficient of determination to denote the square of the coefficient of 
correlation. 
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in the multiple regression equation are all variables that are mentioned in the 
partial correlation coefficient. For instance, if we want to compute r } 4237 , we 
have to run a regression of y on x 4 , x 2 , x 3 , and x 7 . Let t 4 = (3 4 /SE(|3 4 ), where (3 4 
is the coefficient of x 4 . Then 

2 - <3 

r ' 4 237 t\ + d.f. 

and d.f. = (n - 5) in this case (one a and four (}’s). 

Partial correlations are very important in deciding whether or not to include 
more explanatory variables. For instance, suppose that we have two explana¬ 
tory variables x, and x 2 , and r\ 2 is very high, say 0.95, but r \ 2 , is very low, say 
0.01. What this means is that if x 2 alone is used to explain y, it can do a good 
job. But after x, is included, x 2 does not help any more explaining y; that is, x, 
has done the job of x 2 . In this case there is no use including x 2 . In fact, we can 
have a situation where, for instance, 


^ = 

0.95 

and 

^2 

= 0.96 

2 = 

= 0.1 

and 

r 2 2 1 

= 0.1 


In this case each variable is highly correlated with y but the partial correla¬ 
tions are both very low. This is called multicollinearity and we will discuss this 
problem later in Chapter 7. In this example we can use x, only or x 2 only or 
some combination of the two as an explanatory variable. For instance, suppose 
that X, is the amount of skilled labor, x 2 the amount of unskilled labor, and y 
the output. What the partial correlation coefficients suggest is that the separa¬ 
tion of total labor into two components—skilled and unskilled—does not help 
us much in explaining output. So we might as well use x, + x 2 or total labor as 
the explanatory variable. 

4.6 Relationships Among Simple, Partial, 
and Multiple Correlation Coefficients 

To study the relationships among the different types of correlation coefficients, 
we need to make use of the relationship that at each stage the residual sum of 
squares RSS = 5^,(1 - R 2 ). Suppose that we have two explanatory variables 
x, and x 2 . Then 

S vv (l - R 2 12 ) = residual sum of squares after both x, and x 2 are included 
S yv ( 1 - tj ,) = residual sum of squares after the inclusion of x, only 

Now r \ 2 , measures the proportion of this residual sum of squares explained by 
x 2 . Hence the unexplained residual sum of squares after x 2 is also included is 

(1 - r? 2 ,)5 VV (1 - r 2 J 
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But this is equal to S vv (l - R 2 , 2 ). Hence we get the result that 

1 - R 2 n = (1 - r?,)(l - 4,) 

If we have three explanatory variables we get 

(1 - Rj. m ) = (1 - /-?,)(1 - 4 ,)(1 - r^ 3 12 ) (4.12) 

The subscripts, 1, 2, 3 can be interchanged in any required order. Thus if we 
need to consider them in order 3, 1, 2, we have 

(1 - R] m ) = (1 - ^,)(1 - rj 13 )( 1 - Tv 2 31 ) 

Note that the partial r l can be greater or smaller than the simpler r 1 . For 
instance, the variable x 2 might explain only 20 % of the variance of y, but after 
x x is included, it can explain 50% of the residual variance. In this case the 
simple r 2 is 0.2 and the partial r 2 is 0.5. We are thus talking of the proportions 
explained of different variances. 

On the other hand, the simple r 1 and the partial r 2 can never be greater than 
R 2 (the square of the multiple correlation coefficient). This is clear from the 
formulas we have stated. 


Two Illustrative Examples 

The following examples illustrate simple r 1 , partial r 2 , and R 2 . The first example 
illustrates some problems in interpreting multiple regression coefficients when 
the explanatory variables are proportions. The second example illustrates the 
use of interaction terms. 


Example I: Hospital Costs 

Consider the analysis of hospital costs by case mix by Feldstein . 7 The data are 
for 177 hospitals. The explanatory variables are proportions of total patients 
treated in each category. There are nine categories: M = medical, P = pedi¬ 
atrics, S = general surgery, E = ENT, T = traumatic and orthopedic surgery, 
OS = other surgery, G = gynecology, Ob = Obstetrics. Other = miscella¬ 
neous others. The regression coefficients, their standard errors, r-values, par¬ 
tial r 2 ’s, simple r 1 ' s, and average cost per case are given in Table 4.2. 

The entries in Table 4.2 need some explanation. The /-values are just the 
coefficients divided by the respective standard errors. The partial r 2 ’s are ob¬ 
tained by the formula 


t 2 + d.f. 

7 M. S. Feldstein, Economic Analysis for the Health Service Industry (Amsterdam: North-Hol- 
land, 1967), Chap. 1. 
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Thus, for the variable M, 


partial r 2 = 


(2.38) 2 

(2.38) 2 + 168 


0.0326 


The simple r 2 is just the square of the correlation of average cost per case and 
the proportion of the cases in that category. The two are presented together to 
illustrate how the partial r 2 ’s can be lower or higher than the simple r 2 ’s. 

The regression coefficients in this example have to be interpreted in a differ¬ 
ent way. In the usual case, each regression coefficient measures the change in 
the dependent variable y for unit increase in that independent variable, holding 
other variables constant. In this example, since the independent variables are 
proportions in that category; an increase in one variable holding other variables 
constant does not make sense (in fact, it is impossible to do it). The proper 
interpretation of the coefficients in this case is this: Put the value of M - 1, all 
others 0. Then the estimated value of the dependent variable = (constant + 
coefficient of M) = 69.51 + 44.97 = 114.48. This is the average cost of treating 
a case in M. Similarly, constant + coefficient of P = 69.51 - 44.54 = 24.97 
is the average cost of treating a case in P. Finally, putting M, P, .... Ob all = 
0, we get the constant term = 69.51 as the average cost of treating a case in 
the “Other” category. These coefficients are all presented in the last column of 
Table 4.2. What the regression equation enables us to estimate in this case is 
the average cost of treating a case in each category. The standard errors of 
these estimates can be calculated as SE(& + (3,) if we know the covariance 
between the constant term a and the other regression coefficient (3. 

An important point to note is that we have not estimated a coefficient for the 
category “other. ” An alternative procedure in which we estimate coefficients 
for all the categories directly would be to include all the variables and delete 
the constant term. We have to delete the constant term because the sum of all 
the explanatory variables (which are proportions of cases treated in each cat¬ 
egory) is equal to 1 for all observations. Since the constant term corresponds 
to an explanatory variable that is identically equal to 1 for all observations, we 
cannot include it in the regression equation. If we do, we have a situation where 
one of the explanatory variables is identically the sum of the other explanatory 
variables. In this case we cannot estimate all the coefficients. 

To see what the problem is, consider the equation 


y = a + ^ l x, + p 2 *2 + (3 3 -x 3 + u 

where x 3 = x, + x 2 . We can easily see that we cannot get separate estimates 
of (3,, (3 2 , and (3 3 since the equation reduces to 

y = ol + (3|X| + (3 2 x 2 + (3j(X| + x 2 ) + u 
= a + ((3, + (3 3 )x, + (|3 2 + p 3 )x 2 + u 

such a situation is known as “perfect multicollinearity” and is discussed in 
greater detail in Chapter 7. 
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A regression equation with no constant term is also called “regression 
through the origin” because if the x-variables are all zero, the y-variable is also 
zero. The estimation of such an equation proceeds as before and the normal 
equations are defined as in equations (4.7) and (4.8) except that we do not apply 
the “mean corrections.” That is, we define 

S n = X x l and not X (*i, - *i) 2 

5 i 2 = X and not X (*i, - *i)(* 2 i - * 2 ) etc. 

After these changes are made, the solution of the normal equations proceeds 
as before. The definition of RSS and R 2 also proceeds as before, with the 
change that in S yy as well, no mean correction be applied. 

Cautionary Note: Many regression programs allow the option of including or 
excluding the constant term. However, some of them may not give the correct 
R 2 when running a regression equation with no constant term. If the R 1 is com¬ 
puted as 1 - RSS/5„ and S vv is computed using the mean correction, we might 
get a very small R 2 and sometimes even a negative R 2 . 

Note that with two explanatory variables, and no constant term, RSS = 
X y* - 0 i(X y, x h) - 02(X yi x 2i ) and if 5 VV is computed (wrongly) with a 
mean correction, so that S n = X ~ n S’ 1 ' it can happen that S vy < RSS and 
thus we get negative R 2 . 

The correct R 2 in this case is 1 - RSS/X >’?• Of course, if y is very high 
X y] will be very high compared with RSS and we might get a high R 2 . But one 
should not infer from this that the equation with no constant term is better than 
the one with a constant term. The two R 2 values are not comparable. With a 
constant term, the R 2 measures the proportion of XfVi “ y) 2 explained by the 
explanatory variables. Without a constant term that R 2 explains the proportion 
of X >’, 2 explained by the explanatory variables. Thus we are talking of propor¬ 
tions of two different things, and hence we cannot compare them. 


Example 2: Demand for Food 

Table 4.3 presents data on per capita food consumption, price of food, and per 
capita income for the years 1927-1941 and 1948-1962. 8 Two equations are es¬ 
timated for these data: 

Equation 1: log q = a + (3, log p + (3 2 log y 

Equation 2: log q = a + (3, log p + (3 2 log y + (3, log p log y 

The last term In Equation 2 is an interaction term that allows the price and 
income elasticities to vary. Equation 1 implies constant price and income elas¬ 
ticities. 


8 The data are from Frederick V. Waugh, Demand and Price Analysis: Some Examples from 
Agriculture, U.S.D.A. Technical Bulletin 1316, November 1964, p. 16. The value of y for 1955 
has been corrected to 96.5 from 86.5 in that table. 
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Table 4.3 Indexes of Food Consumption, Food Price, and Consumer Income” 


Year 

Food 

Consumption 
per Capita, q 

Food Price, p b 

Consumer 
Income, y c 

1927 

88.9 

91.7 

57.7 

1928 

88.9 

92.0 

59.3 

1929 

89.1 

93.1 

62.0 

1930 

88.7 

90.9 

56.3 

1931 

88.0 

82.3 

52.7 

1932 

85.9 

76.3 

44.4 

1933 

86.0 

78.3 

43.8 

1934 

87.1 

84.3 

47.8 

1935 

85.4 

88.1 

52.1 

1936 

88.5 

88.0 

58.0 

1937 

88.4 

88.4 

59.8 

1938 

88.6 

83.5 

55.9 

1939 

91.7 

82.4 

60.3 

1940 

93.3 

83.0 

64.1 

1941 

95.1 

86.2 

73.7 


(World War II years excluded) 


1948 

96.7 

105.3 

82.1 

1949 

96.7 

102.0 

83.1 

1950 

98.0 

102.4 

88.6 

1951 

96.1 

105.4 

88.3 

1952 

98.1 

105.0 

89.1 

1953 

99.1 

102.6 

92.1 

1954 

99.1 

101.9 

91.7 

1955 

99.8 

100.8 

96.5 

1956 

101.5 

100.0 

99.8 

1957 

99.9 

99.8 

99.9 

1958 

99.1 

101.2 

98.4 

1959 

101.0 

98.8 

101.8 

1960 

100.7 

98.4 

101.8 

1961 

100.8 

98.8 

103.1 

1962 

101.0 

98.4 

105.5 


"1957-1959 = 100 

'’Retail prices of Bureau of Labor Statistics, deflated by dividing by Consumer Price Index. 
'Per capita disposable income, deflated by dividing by Consumer Price Index 


The estimates of a., f},, (3 2 , and (5, for 1927-1942 and 1948-1962 separately 
and for both periods combined are presented in Table 4.4. Note that we have 
presented the t-ratios in parentheses because this is more convenient to inter¬ 
pret. If we are interested in significance tests, the /-ratios are more convenient 
and if we are interested in obtaining confidence intervals, it is convenient to 
have the standard errors. 
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RSS/d.f. The importance of this is discussed in Section 4.9. 
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The interaction term has a very low t-value in the equations for 1927-1942 
and 1948-1962 separately. Hence we will consider only the equation for the 
combined data. For this equation the income coefficient has a low r-value and 
the sign also appears to be wrong at first sight. However, the equation does 
give positive income elasticities. The income elasticities are given by 




= zr!r~ = ^2 + P 3 log p 
d log y 

= -0.718 + 0.211 log p 


log p ranges from 4.33 to 4.66 in the data. Thus the income elasticity ranges 
from 0.195 to 0.265. 

The price elasticity is given by 


= d log q 
T ' w> d log p 


= Pi + P, log y = 


-0.996 + 0.211 logy 


Thus, as income increases, demand for food becomes more price inelastic. For 
the data in Table 4.3, log y ranges from 3.78 to 4.66. Thus the price elasticity 
ranges from -0.198 to -0.013. 


4.7 Prediction in the Multiple 
Regression Model 

The formulas for prediction in multiple regression are similar to those in the 
case of simple regression except that to compute the standard error of the pre¬ 
dicted value we need the variances and covariances of all the regression coef¬ 
ficients. Again we will present the expression for the standard error in the case 
of two explanatory variables and then the expression for the general case of k 
explanatory variable. But we do not need to compute this general expression 
because there is an easier way of generating the standard error which we will 
describe in Chapter 8 (see section 8.5). 

Let the estimated regression equation be 

y = a + (3,*, + (3 2 x 2 

Now consider the prediction of the value y 0 of y given values x l0 of x { , and x 20 
of x 2 , respectively. These could be values at some future date. 

Then we have 


y 0 = a + (3,x 10 + (3 2 x 20 + « 0 


Consider 


So — a + P|X, 0 + 02 x 2o 


The prediction error is 

y 0 - y 0 = d - a + (Pi - P,)* 10 + (P 2 ~ P 2 )^ 2 o “ «o 
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Since E(a - a), E( (3, - 3,), E(fi 2 - P 2 ), and E(u a ) are all equal to zero, we 
have E(y 0 - y 0 ) = 0. Thus the predictor >' (l is unbiased. Note that what we are 
saying is E(y 0 ) ~ E(y 0 ) (since both y 0 and y 0 are random variables). The vari¬ 
ance of the prediction error is 

°" 2 (l + + (x w - xj 2 var(P,) 

+ 2(x w - *,) (x 20 - x 2 ) cov(Pj, p 2 ) + (* 20 - * 2 ) 2 var(p 2 ) 
In the case of k explanatory variables, this is 

CT 2 (1 + -) + X X (*<0 - *,)(-**> - Xj) COV(3„ p) 

We estimate cr 2 by RSS/(« - 3) in the case of two explanatory variables and 
by RSS/(« - k — 1) in the general case. 

Illustrative Example 

Again consider the illustrative example in Section 4.3. The regression is 

y = 4.0 + 0.7x, + 0.2 x 2 

Consider the prediction of y for x l0 = 12, and x 2( , = 7. We have 
y B = 4.0 + 0.7(12) + 0.2(7) = 13.8 

Note that 

x ID — JEj— 12 — 10=2 
x 20 — x 2 = 1 — 5 = 2 

Using the expressions for var(3i), var(|3 2 ), cov(j3,, p 2 ), and d 2 derived in Sec¬ 
tion 4.3, we get the estimated variance of the prediction error as 



The standard error of the prediction is 0.318. Thus the 95% confidence interval 
for the prediction is 

13.8 ± 2.086(0.318) or 13.8 ± 0.66 or (13.14, 14.46) 
Comment 

In the case of simple regression we said (in Section 3.7) that the variance of the 
prediction error increases as we increase the distance of the point x 0 from x. In 
the case of multiple regression we cannot say that the variance of the prediction 
error increases with the Euclidean distance [(x 10 - i,) 2 + (x 20 - jc 2 ) ,/2 . This is 
because there is the covariance term as well. For instance, in our example let 
us change x 20 to 3. Now x 20 — x 2 — —2. The Euclidean distance is the same as 
before. It is V2 2 + ( —2) J .However, the variance of the prediction error is now 
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0 07 (‘ + 25) + 4 (l, + h + li) <0 - 7) = 0213 

which is higher because of the last term turning positive. If x x and x 2 are highly 
correlated, we will observe wide discrepancies in the variances of the predic¬ 
tion error for the same Euclidean distance of the value of x a from the sample 
mean. Thus the simple relationship we found in the case of simple regression 
does not hold in multiple regression. 


4.8 Analysis of Variance and 
Tests of Hypotheses 


In Section 4.3, result 6, we discussed an F-test to test hypotheses about (3, and 
p 2 - An alternative expression for this test is defined by the statistic 


F = 


(RRSS - URSS)/r 
URSS/(n - k - 1) 


(4.13) 


where URSS = unrestricted residual sum of squares 

RRSS = restricted residual sum of squares obtained by imposing the 
restrictions of the hypothesis 
r = number of restrictions imposed by the hypothesis 

The derivation of this test is given in the appendix to this chapter. A proof 
in the general case can be found in graduate textbooks in this area. 9 

As an illustration, consider the hypothesis (3, = 1, (3 2 = 0 in the illustrative 
example in Section 4.3. The unrestricted residual sum of squares is URSS = 

1.4. To get the restricted residual sum of squares RRSS, we have to minimize 

X iy, - ol - l.o*,, - o.Ox 2 ,) 2 

Since both (3, and J3 2 are specified, there is only a. to be estimated and we get 
a = y — l.Oxj. Thus 

RRSS = 2 Lv, - y ~ (*i, - *i)] 2 
— S n + 5,, — 25, v 
= 10.0 + 12.0 - 2(10.0) = 2.0 


Also, we have r = 2, n = 23, k 

(2.0 


F = 


2. Hence 
1.4)/2 0.3 


1.4/20 


0.07 


4.3 


which is exactly the value we obtained in Section 4.3. 
In the special case where the hypothesis is 


P, = p 2 


• = P* 


0 


9 See, for instance, Maddala, Econometrics, pp. 457—460. 
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we have URSS = S TV ( 1 - R 2 ), and RRSS = S yy . Hence the test is given by 


[S yy - .S„ v (l - R 2 )]/k = R 2 n - k - 1 

.S >y (l - R 2 )l(n - k - 1) l — R 2 k 


(4.14) 


which has an F-distribution with degrees of freedom k and (n — k — 1). What 
this test does is test the hypothesis that none of the x’s influence y; that is, the 
regression equation is useless. Of course, a rejection of this hypothesis leaves 
us with the question: Which of the x's are useful in explaining y? 

It is customary to present this test in the form of an analysis of variance 
table similar to Table 3.3 we considered for simple regression. This is shown in 
Table 4.5. What we do is decompose the variance of y into two components: 
that due to the explanatory variables (i.e., due to regression) and that which is 
unexplained (i.e., residual). 

As an illustration, consider the hospital cost regression in Section 4.6. The 
analysis of variance table is given in Table 4.6. This F value is highly signifi¬ 
cant. The 1% point from the F-tables with 8 and 168 d.f. is 2.51 and the ob¬ 
served F is much higher. 

Of course we reject the hypothesis that (3, = (3 2 = ' ' ' = P* ” 0. All this 
means is that the case-mix variables are important in explaining the variation 
in average cost per case between the hospitals. But it does not say which vari¬ 
ables are important. 


Table 4.5 Analysis of Variance for the Multiple Regression Model 


Source of 
Variation 

Sum of 
Squares, SS 

Degrees of 
Freedom, 
d.f. 

Mean Square, 
SS/d.f. 

F 

Regression 

R 2 S yy 

k 

R 2 S yy /k = MS, 

F MS > 

Residual 

0 - R 2 )S vy 

n — k — 1 

- f ,S '; - MS, 

ms 2 

Total 

_^VV_ 

n - 1 

n - k - 1 2 



Table 4.6 Analysis of Variance for the Hospital Cost Regression in Section 4.6 



Sum of 


Mean 


Source of 

Squares, 

Degrees of 

Square, 


Variation 

SS 

Freedom, d.f. 

SS/d.f. 

F 

Regression 

10,357 

8 

1294.625 

1294.625 
~ 138.756 

Residual 

23,311 

168 

138.756 

= 9.33 

Total 

33,668 

176 
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Nested and Nonnested Hypotheses 

The hypotheses we are interested in testing can usually be classified under two 
categories: nested and nonnested. Consider, for instance, the regression 
models 


y = P|X, + p 2 x 2 + P 3 x 3 + u model 1 

y = P,jc, + p 2 jc 2 + u model 2 

A test of the hypothesis H 0 : p 3 = 0 versus //,: p 3 ¥= 0 is a test of the hypothesis 
that the data are generated by model 2 versus that the data are generated by 
model 1. Such a hypothesis is called a nested hypothesis because the parame¬ 
ters in model 2 form a subset of the parameters in model 1. A hypothesis 

H 0 : p, + p 2 + p 3 = 0 versus H x : p, + p 2 + p 3 # 0 

can also be called a nested hypothesis because we can reparametrize the orig¬ 
inal equation as 

y = (Pi + p 2 + P 3 )x, + p 2 (x 2 - x,) + P 3 (x 3 - X,) + u 
= yx, + p 2 (x 2 - x,) + p 3 (x 3 - x,) + u 

where y = p, + p 2 4- p 3 . Now consider the parameter set as (y, p 2 , p 3 ) and 
H 0 : y = 0 versus //,: y ^ 0. 

Similarly, if we have the hypothesis 

H 0 : Pi + P 2 + P 3 = 0, p 2 — p 3 = 0 

we reparametrize the original equation by defining y, = p, + p 2 + p 3 , y 2 = 
P 2 - P 3 > 7 3 = P 3 so that p 3 = y 3 , P 2 = y 2 + y 3 , Pi = yi - y 2 ~ 2y 3 and the 
original model becomes 

y = (7i 7 7z - 2y 3 )x, + (y 2 + y 3 )x 2 + y 3 x 3 + u 
= 7i x i + 72^2 - *i) + 7 3 fe + x 2 - 2x,) + u 

Now we consider the parameter set (y„ y 2 , y 3 ) and H 0 specifies the values of 
7i and y 2 . 

Suppose, on the other hand, that we consider the two regression models: 

y = p,x, + w, model 3 

y = p 2 x 2 + u 2 model 4 

// 0 : the data are generated by model 3 

H x : the data are generated by model 4 

Now the parameter set specified by model 3 is not a subset of the parameter 
set specified by model 4. Hence we say that hypotheses H 0 and //, are non¬ 
nested. 

In the following sections we consider nested hypotheses only. The problem 
of selection of regressors, for instance, whether to include Xi only or x 2 only, is 
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a problem in testing nonnested hypotheses. This problem is treated briefly in 
Chapter 12. 

Tests for Linear Functions of Parameters 

We have until now discussed tests for parameters. Very often we need tests for 
functions of parameters. The functions we need to consider can be linear func¬ 
tions or nonlinear functions. We discuss linear functions first. 

Suppose that we have estimated a production function in a log-linear form: 

log X = a + Pi log L + p 2 log K + u 

where X is the output, L the labor input, and K the capital input. Then a test 
for constant returns to scale is a test of the hypothesis P, + P 2 = 1- We can 
use a /-test to test this hypothesis as follows. 

We get the least squares estimates p, and j3, and define 

u 2 C v = cov(p„ p ( ) i,j = 1,2 

Then, under the null hypothesis p, + p 2 - 1 = 0, we have the result that 
P, + P 2 — 1 is normally distributed with mean 0 and variance cr 2 (C,, + 2C I2 + 
C 22 ). 

Since RSS/cr 2 has an independent x 2 distribution with degrees of freedom 
(« — 3), the /-statistic to test the hypothesis is given by 

t = Pi + k - i / / Rss 
VC,, + 2C|2 + C 22 / Y n - 3 

which has a /-distribution with degrees of freedom n - 3. 

An alternative procedure is to derive URSS and RRSS and use the F-test. 
URSS is the residual sum of squares we obtain when we estimate a, p,, and 
P 2 . To get RRSS, we have to use the restriction p, + p 2 - 1 = 0. The way to 
do this is to eliminate p 2 since p 2 = 1 - P,. Thus we get the regression equation 
as 


log X = a + p, log L + (1 — Pi) log K + u 


or 


(log X - log K) = a + P,(log L - log K) + u 

Thus to get RRSS we regress (log X - log K) on (log L - log K). The residual 
sum of squares from this equation gives us the required RRSS. 

The same procedure applies if we have two linear restrictions. Consider, for 
instance, the case of the multiple regression equation 

y = a, + P,jc, + p^ + p 3 * 3 + u 

and we need to test the restrictions 


P, + p 2 + p 3 = 1 P 2 - 2p 3 = 0 
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Note that these restrictions can be written as 

p 2 = 2p 3 and P, = 1 - 3p 3 
Substituting these in the original equation, we get 

y = a + (1 - 3p 3 )x, + 2p 3 x 2 + p 3 x, + e 


or 


(_y - jc,) = a + p 3 ( — 3jci + 2jc 2 + x 3 ) + e 

Thus we get the restricted residual sum of squares RRSS by running a regres¬ 
sion of (y - jc,) on (- 3x, + 2jc 2 + jc 3 ) with a constant term. 


Illustrative Example 

Consider again the illustrative example in Section 4.3. Suppose that the prob¬ 
lem is to test the hypothesis: 

Pi 5 . Pi 5 

fe - 5 a8a,ns ' S 3 

at the 5% significance level. The hypothesis is p 2 = 0.6p,. To compute the 
restricted estimates, we substitute p 2 = 0.6p, and estimate the equation 

y = a + p,x + e 

where 

< x = jc, + 0.6x 2 

We now have 

S„ = S„ + (0.6) 2 S 22 + 2(0.6)5, 2 

= 12 + (0.6) 2 (12) + 2(0.6)(8) = 25.92 
$xy = "I" 0.65 2 , 

= 10 + 0.6(8) = 14.8 

Hence 


0 . 


14.8 

25.92 


0.57 


RRSS = S n - p,S r , = 10 - 8.45 = 1.55 


Thus RRSS - URSS = 1.55 - 1.40 = 0.15 and the /-’-statistic is 


1.4/20 0.07 

which, with d.f. 1 and 20, is not significant at the 5% level. 

By the alternative method, we consider p 2 - 0.6p,. It is normally distributed 
with mean zero (under H 0 ) and using the expression for C we note that its 
variance is 
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cr 


12 12 / 1 
- + M> S - 2(0.6) -- 


0.324o- 2 


Since 

0 2 - 0.6$, = 0.2 - 0.6(0.7) = -0.22 
we have the /^-statistic 

( —0.22) 2 /0.324 _ 015 
1.4/20 “ 0.07 

which is the same as the one obtained earlier. 


4.9 Omission of Relevant Variables and 
Inclusion of Irrelevant Variables 


Until now we have assumed that the multiple regression equation we are esti¬ 
mating includes all the relevant explanatory variables. In practice, this is rarely 
the case. Sometimes some relevant variables are not included due to oversight 
or lack of measurements. At other times some irrelevant variables are included. 
What we would like to know is how our inferences change when these problems 
are present. 


Omission of Relevant Variables 


Let us first consider the omission of relevant variables first. Suppose that the 
true equation is 


y = P,*, + 32*2 + w (4.15) 

Instead, we omit * 2 and estimate the equation 

y = Pi+i + « 

This will be referred to as the “misspecified model.” The estimate of 3i we get 
is 


2 




2 


r 2 

*1 


Substituting the expression for y from (4.15) in this, we get 
2 *i(P,*, + 32*2 + «) 


p. 

Since E (2 *,«) = 0 we get 




2 X ,* 2 2 X l U 

Pi + P 2 Vi , + 


x] 


£(Pl) ~~ Pi + ^21P 2 


(4.16) 
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where b 2l - ^ x i x \ is the regression coefficient from a regression of x 2 
on jc, . 

Thus (3, is a biased estimator for p, and the bias is given by 


bias = (coefficient of the excluded variable) 

regression coefficient in a regression of the 
excluded variable on the included variable 

If we denote the estimator for p, from (4.15) by p,, the variance of p, is given 
by 

<T 2 

= 2 X 1 

On the other hand, 




where 



var(P,) = ~ 

On 

Thus p, is a biased estimator but has a smaller variance than p,. In fact, the 
variance would be considerably smaller if r\ 2 is high. However, the estimated 
standard error need not be smaller for p, than for p,. This is because a 2 , the 
estimated variance of the error, can be higher in the misspecified model. It is 
given by the residual sum of squares divided by degrees of freedom, and this 
can be higher (or lower) for the misspecified model. 

Let us denote the estimated variance by S 1 . Then the formula connecting the 
estimated variances is 


S 2 (P,) = I ~ 

‘ >S 2 (P|) 1 — ^2 1 

Thus the standard error of Pi will be less than the standard error of p, only if 
t"h > Tv2'1* 

We have considered the case of only one included and one omitted variable. 
In the case where we have k - 1 included variables and the kt h variable omit¬ 
ted, formula (4.16) generalizes to 

£(P,) = p, + b h p* / = 1, 2,- k - 1 (4.17) 

where b k , is the regression coefficient of x, in the auxiliary regression of x k on 
x u x 2 , . . . , k k ] . That is, we consider the regression of the omitted variable x k 
on all the included variables. 

In the general case 10 where we have several included variables and several 


l0 We will not go through the derivation here because the use of matrix notation is unavoidable. 
Proofs can be found in many textbooks in econometrics. See, for instance, Maddala, Econo¬ 
metrics, p. 461. 
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omitted variables, we have to estimate the “auxiliary” multiple regressions of 
each of the excluded variables on all the included variables. The bias in each 
of the estimated coefficients of the included variables will be a weighted sum 
of the coefficients of all the excluded variables with weights obtained from 
these auxiliary multiple regressions. 

Suppose that we have k explanatory variables, of which the first k t are in¬ 
cluded and the remaining (k - k { ) are omitted. Then the formula corresponding 
to (4.16) and (4.17) is 

k 

£(P,) = P, + 2 i - 1 , 2, . . . , k (4.18) 

j=*i + i 

where b } , is the regression coefficient of the ith included variable in a regression 
of the jth omitted variable on all the included variables. Note that we pick the 
coefficients of the ith included variable from the (k - k t ) auxiliary multiple 
regressions. 

The formulas (4.I6M4.18) can be used to get some rough estimates of the 
direction of the biases in estimated coefficients when some variables are omit¬ 
ted because of lack of observations or because they are not measurable. We 
will present two examples, one in which the omitted variable is actually mea¬ 
sured and the other in which it is not. 


Example 1: Demand for Food in the United States 

Consider the estimation of the demand for food in the United States based on 
the data in Table 4.9. Suppose that the “true” equation is 

Qd ~ a + Pif 3 /) + P2^ + u 
However, we omit the income variable. We get 

q d = 89.97 + 0. 107P d <t 2 = 2.338 

( 1185 ) (0 118 ) 

(Figures in parentheses are standard errors.) 

The coefficient of P D has the wrong sign. Can this be attributed to the omis¬ 
sion of the income variable? The answer is yes, because the coefficient of P D 
is a biased estimate with the bias given by 

bias = (coefficient of the income variable) 

x (the regression coefficient of income on price) 

The coefficient of income is expected to be positive. Also, given that the data 
are time-series data, we would expect a positive correlation between P D and Y. 
Hence the bias is expected to be positive, and this could turn a negative coef¬ 
ficient to a positive one. 

In this case the regression equation with Y included gives the result 
Q d = 92.05 - 0.142F C + 0.236 Y d 2 = 1.952 

(5 84 ) (0 067 ) (0 031 ) 
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(Figures in parentheses are standard errors.) Note that the coefficient of P D is 
now negative. Also, note that the standard error of the coefficient of P D is 
higher in the misspecified model than in the “true” model. This is so despite 
the fact that the variance of p, is expected to be smaller in the misspecified 
model. (Check the relationship between r\ 2 and rj 2 , with the data.) 

Example 2 : Production Functions and 
Management Bias 

In the estimation of production functions we have to omit the quality of inputs 
and managerial inputs because of lack of measurement. Consider the estima¬ 
tion of the production function 

y = Pl*l + P2*2 + P 3 *3 + « 

where 

y = log of output 
x, = log of labor input 
x 2 = log of capital input 
x, = log of managerial input 

Now x 3 is not observable. What will be the effect of this on estimates of p, and 
p 2 ? From the formula (4.17) we have 

£(Pl) = Pi + (?3lP 3 

^(02) = P 2 + ^32P 3 

where fe 31 and fc 32 are the regression coefficients in the regression of x 3 on x, and 
x 2 . Now p, + p 2 is often referred to as “returns to scale.” Let us denote this 
by S. The estimated returns to scale, $ is p, + p 2 . Thus 

E(S) = 5 + p 3 (fc 31 + M 

Since p 3 is expected to be positive, the bias in the estimation of returns to scale 
will depend on the sign of + b n . If we assume that managerial input does 
not increase proportionately with measured inputs of labor and capital, we 
would expect b 3l + b n to be negative and thus there is a downward bias in the 
estimates of returns to scale. 

Inclusion of Irrelevant Variables 

Consider now the case of inclusion of irrelevant variables. Suppose that the 
true equation is 

y = p,x, + u 

but we estimate the equation 


y = Pi*i + P2*2 + v 
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The least squares estimators p, and p 2 from this misspecified equation are given 
by 


5 5 _ ^22^1y ^I2^2y 

Pi c c _ c 2 
‘- > U* 3 22 O 12 

R _ ^11^2y ~ $12$!v 

SllS22 ~ *Sj2 

where 5„ = Z x l> S ly = Z *iy> S n = Z * 1 * 2 - and so on. Since y = p,*, + 
u we have E(S 2y ) = P|.S’ 12 and E(S iy ) = 0,5 n . Hence we get 

£(P.) = P, and £(p 2 ) = 0 

Thus we get unbiased estimates for both the parameters. This result, coupled 
with the earlier results regarding the bias introduced by the omission of relevant 
variables might lead us to believe that it is better to include variables (when in 
doubt) rather than exclude them. However, this is not so, because though the 
inclusion of irrelevant variables has no effect on the bias of the estimators, it 
does affect the variances. 

The variance of p„ the estimator of p, from the correct equation is given by 



On the other hand, from the misspecified equation we have 


var(Pi) 


a 2 

(1 — r i2)‘Sn 


where r n is the correlation between jc, and jc 2 . Thus var (P,) > var(P,) unless 
r,j = 0. Hence we will be getting unbiased but inefficient estimates by including 
the irrelevant variable. It can be shown that the estimator for the residual vari¬ 
ance we use is an unbiased estimator of a 2 . Thus there is no further bias arising 
from the use of estimated variance from the misspecified equation. 11 


4.10 Degrees of Freedom and R 2 


If we have n observations and estimate three regression parameters as in equa¬ 
tion (4.1), we can see from the normal equations (4.2M4.4) that the estimated 
residuals u, satisfy three linear restrictions: 

'Z «< = 0 Z *i.«. = 0 Z x 2,u, = 0 (4-19) 

or, in essence, there are only (« - 3) residuals to vary because, given any 
(n — 3) residuals, the remaining three residuals can be obtained by solving 
equations (4.19). This point we express by saying that there are (n — 3) degrees 
of freedom. 


"See Maddala, Econometrics, p. 157. 



166 


4 MULTIPLE REGRESSION 


As we saw earlier, the estimate of the residual variance a 2 is given by 

-2 _ RSS 

degrees of freedom 

As we increase the number of explanatory variables RSS declines but there 
is a decrease in the degrees of freedom as well. What happens to d 2 depends 
on the proportionate decrease in the numerator and the denominator. Thus 
there will be a point when d 2 will actually start increasing as we add more 
explanatory variables. It is often suggested that we should choose the set of 
variables for which d 2 is the minimum. We discuss the rationale behind this 
procedure in Chapter 12. 12 

This is also the reason why, in multiple regression problems, it is customary 
to report what is known as adjusted R 2 , denoted by R 1 . The measure R 2 defined 
earlier keeps on increasing (until it reaches 1.0) as we add extra explanatory 
variables and thus does not take account of the degrees-of-freedom problem. 
R 2 is simply R 2 adjusted for degrees of freedom. It is defined by the relation 

i - R 2 = n ~ * , d - R 2 ) (4.20) 

n — k — l 

where k is the number of regressors. We subtract (k + 1) from n because we 
estimate a constant term in addition to the coefficients of these k regressors. 
We can write (4.20) as 

(1 - R 2 )S (1 - R 2 )S VV „ 7 

-, =-;-T = a (4.21) 

n — 1 n — k — 1 

Since and n are constant, as we increase the number of regressors included 
in the equation, d 2 and (1 - R 2 ) move in the same direction as d 2 and R 2 move 
in the opposite direction. Thus the set of variables that gives minimum d 2 is 
also the set that maximizes R 2 . 

Also, from equation (4.20) we can easily see that if R 2 < kl(n - 1), 1 — 
R 2 > (n - k - 1)/(« -1) and hence 1 - A 2 > I. Thus R 2 is negative! For 
example, with 2 explanatory variables and 21 observations, if R 2 < 0.1, R 2 will 
be negative. 

There is a relationship between the t tests and /-’-tests outlined earlier and 
R 2 . If the t ratio for the coefficient of any variable is less than 1, then dropping 
that variable will increase R 2 . More generally, if the /--ratio for any set of vari¬ 
ables is less than 1, then dropping this set of variables from the regression 
equation will increase R 2 . Since the single-variable case is a special case of the 
many-variable case, we will prove the latter result. Equation (4.21) shows the 
relationship between R 2 and d 2 . So, instead of asking the question of whether 
dropping the variables will increase R 2 , we can as well ask the question of 
whether d 2 will decrease. 

Let a] be the estimate of a 2 when we drop r regressors. Then 

l3 In the limiting case when the number of parameters estimated is equal to the number of ob¬ 
servations, we get cr = 0/0. 
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restricted residual sum of squares 

<77 = --- 

1 n - (k - r) - 1 

Since the unrestricted residual sum of squares is (n — k — 1)<t 2 , the /-’-test 
outlined earlier is given by 

’ _ [(n — k + r — l)cr| - (n — k - 1 )6 2 \lr 

[(n - k — 1 )d 2 ]/(/; — k - 1) 


Solving for <t,/<t 2 yields 


aj = a + F 
d 2 a + 1 


n k 1 

where a = - 

r 


Thus d, § d 2 according as F § 1. What this says is that if the /-’-ratio associated 
with a set of explanatory variables is < 1, we can increase R 2 by dropping that 
set of variables. Since for 1 degree of freedom in the numerator, F = t 1 , what 
this means is that if the absolute value of the /-ratio for any explanatory vari¬ 
able is less than l, dropping that variable will increase R 2 . However, we have 
to be careful about /-ratios for individual coefficients and /-’-ratios for sets of 
coefficients and we will discuss the relationships between / and F ratios. There 
are two cases that create problems: 


Case 1. The /-ratios are less than 1 but the F-ratio is greater than 1. 

Case 2. The /-ratios are all greater than 1 but the F-ratio for a set of varibles 

is < 1, , 

Case 1 occurs when the explanatory variables are highly intercorrelated. 
(This is called multicollinearity, which we discuss in Chapter 7.) In this case 
that all the /-ratios are less than 1 does not mean that we can increase R 2 by 
dropping all the variables. Once we drop one variable the other /-ratios will 
change. 

In case 2, though by dropping any one variable we cannot increase R 2 , it 
might be possible to get a higher R 2 by dropping a set of explanatory variables. 
Suppose that we have a regression equation in which all the explanatory vari¬ 
ables have /-ratios which are greater than 1. Obviously, we cannot increase R 2 
by dropping any one of the variables. But how do we know whether we can 
increase R 2 by dropping some sets of variables without searching over all the 
sets and subsets? 

To answer this question we will state a simple rule that gives the relationship 
between / and F ratios. Consider a set of k variables that are candidates for 
exclusion; and let F{k,n) be the F-ratio associated with these variables (n is the 
sample size). Then the rule says: If F{k,n) ^ c, the absolute /-values of each 
of the k discarded variables must be less than \ r ki r , that is, if F(k,n) < 1, the 
absolute /-value of each of the k variables is < VF The converse, however, is 
not true. 53 Thus if we do not have at least k variables with absolute /-values 


13 Potluri Rao. “On a Correspondence Between t and F Values in Multiple Regression,” Amer¬ 
ican Statistician, Vol. 30, No. 4, 1976, pp. 190-191. We just present the results here; those 
interested in the derivation can refer to Potluri Rao’s paper. 
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less than \ r k, R 2 cannot be increased by discarding k independent variables at 
a time. 

However, if there are k or more independent variables with absolute /-values 
less than \fk, the F-ratio may or may not be less than 1, and hence we may or 
may not be able to increase R 2 by discarding these variables. But if the R 2 is 
increased, the variables to be discarded must come from the set of independent 
variables with absolute /-values less than V k. 

As an illustration, consider the case k = 7. Since V7 = 2.6, if the F ratio is 
less than I, all we can say about the /-ratios is that the /-ratios are less than 
2.6. But this means that we can have ail the /-ratios significant and yet have the 
R 2 rise by dropping all the variables. 

As yet another example, consider a regression equation with five indepen¬ 
dent variables and /-ratios of 1.2, 1.5, 1.6, 2.3, and 2.7. Note that y/2 = 
1.414, V3 = 1.732, and V5 = 2.236. We consider k = 1, 2, 3, 4, 5 and check 
whether there are k /-ratios < \ r L We note that this is the case for only k = 
3. Thus if we can increase R 2 at ail, it is by dropping the three variables jc„ x 2 , 
jc 3 . Thus all we have to do is to run the regression with these three variables 
excluded and check whether R 2 has increased. 

The point in all this discussion is that in multiple regression equations one 
has to be careful in drawing conclusions from individual /-ratios. In particular, 
this is so for analyzing the effect on R 2 of deletion or addition of sets of vari¬ 
ables, often, in applied work it is customary to run a “stepwise” regression 
where explanatory variables are entered into an equation sequentially (in order 
determined by the maximum partial r 2 at each stage), and to stop at a point 
where R 2 stops increasing. What the previous discussion shows is that it might 
be possible to increase R 2 by introducing a set of variables together. 

Thus there are problems with maximizing R 2 . But if one is going to do it, the 
relationship between /- and F-ratios we have discussed will be of some help. 
The rationale behind the maximization of R 2 is discussed in Chapter 12. 

A Cautionary Note on the Omission of Nonsignificant Variables: Finally, 
there is one other result that needs to be noted regarding the procedure of de¬ 
leting variables whose coefficients are not “significant.” Often researchers are 
perturbed by some “wrong” signs for some of the coefficients. In an effort to 
obtain hopefully “right” signs, statistically insignificant variables are dropped. 
Surprisingly enough, there can be no change in the sign of any coefficient that 
is more significant than the coefficient of the omitted variable. Learner 14 shows 
that the constrained least squares estimate of (3, must lie between ((3, - 
tSj, + tsj), where 

= unconstrained estimate of (3, 

Sj = standard error of (3, 

/ = absolute /-value for the deleted variable 


,4 E. E. Learner, “A Result on the Sign of the Restricted Lease Squares Estimates,” Journal of 
Econometrics, Vol. 3, 1975, pp. 387-390. 
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We will not go through the proof here. It can be found in Learner’s article. The 
result enables us to predict the sign changes in the coefficients of the retained 
variables when one of the variables is deleted. 

As an illustration, consider the problem of estimation of the demand for Cey¬ 
lonese tea in the United States. This example is discussed in Rao and Miller. 15 
The following is the list of variables used. 

Tea = demand for Ceylonese tea in the United States 
Y = disposable income 
P c = price of Ceylonese tea 

P B = price of Brazilian coffee, considered a substitute 

Y, P c , and P B are deflated by the price of food commodities in the United 
States. All equations are estimated in log-linear form. The results are 

log Tea = 3.95 + 0.14 log P B + 0.75 log Y + 0.05 log P c 

(1.991 (0.14) (0.24) (0.41) 

(Figures in parentheses are standard errors.) The coefficient of log P c has the 
wrong sign, although it is not significantly different from zero. However, drop¬ 
ping the variable P B , we get 

log Tea = 3.22 + 0.67 log Y + 0.04 log P c 

(2.02) (0.25) (0.42) 

Another alternative is to drop the variable log P c arguing that the demand for 
Ceylonese tea is price inelastic. This procedure gives us the result 

, log Tea = 3.73 + 0.14 log P B + 0.73 log Y 

(0.71) (0.13) (0 14) 

However, the correct solution to the problem of a wrong sign for log P c is 
neither to drop that variable nor to drop log P B but to see whether some other 
relevant variables have been omitted. In this case the inclusion of the variable 

P, = price of Indian tea which is a close substitute for Ceylonese tea 

produces more meaningful results. The results are now 

log Tea' = 2.84 + 0.19 log P B + 0.26 log Y — 1.48 log P c + 1.18 log P, 

(2.00) (0 13) (0.37) (0.98) (0.69) 

Note the coefficient of log P c is now negative and the income elasticity has 
dropped considerably (from 0.73 to 0.26), and is not significant. 

Of course, in this case the variable P, should have been included in the first 
place rather than as an afterthought. The deletion of log Y from the last equa¬ 
tion will not change the signs of any of the other coefficients by Learner’s rule. 
The resulting equation is 

log Tea = 1.85 + 0.20 log P B - 2.10 log P c + 1.56 log P, 

(1.39) (0.13) (0.39) (0.42) 

Now log P c and log P, are both significant, and have the correct signs. 

l5 Potluri Rao and Roger Miller, Applied Econometrics (Belmont, Calif.: Wadsworth, 1971), pp. 
38-40. 
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4.11 Tests for Stability 


When we estimate a multiple regression equation and use it for predictions at 
future points of time we assume that the parameters are constant over the entire 
time period of estimation and prediction. To test this hypothesis of parameter 
constancy (or stability) some tests have been proposed. These tests can be 
described as: 

1. Analysis-of-variance tests. 

2. Predictive tests. 


The Analysis-of-Variance Test 

Suppose that we have two independent sets of data with sample sizes n ] and 
n 2 , respectively. The regression equation is 

y = a, + 3n*i + 3 j 2 jc 2 + • • • + 3j*** + u f° r the first set 

y = a 2 + 321*1 + 322*2 + • • • + 321 .** + u for the second set 

For the 3’s the first subscript denotes the data set and the second subscript 
denotes the variable. A test for stability of the parameters between the popu¬ 
lations that generated the two data sets is a test of the hypothesis: 

H 0 : 3„ = 320 3l2 ~ 322> • • • ' 3l* = 32*' a i = a 2 

If this hypothesis is true, we can estimate a single equation for the data set 
obtained by pooling the two data sets. 

The /-’-test we use is the F-test described in Secction 4.8 based on URSS and 
RRSS. To get the unrestricted residual sum of squares we estimate the regres¬ 
sion model for each of the data sets separately. Define 

RSS, = residual sum of squares for the first data set 

RSS 2 = residual sum of squares for the second data set 


RSS, 

a 2 

rss 2 

CT 2 


has a x 2 -distribution with d.f. (n, — k - 1) 


has a x 2 -distribution with d.f. (n 2 — k - 1) 


Since the two data sets are independent (RSS, + RSS 2 )/o- 2 has a x 2 distribution 
with d.f. (n, + n 2 -2k - 2). We will denote (RSS, + RSS 2 ) by URSS. The 
restricted residual sum of squares RRSS is obtained from the regression 
with the pooled data. (This imposes the restriction that the parameters 
are the same.) Thus RRSS/cr 2 has a x 2 -distribution with d.f. = («, + n 2 ) — 
k-l. 
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(RRSS - URSS)/(/c + 1) 

URSS/(n, + n 2 - 2k - 2) 1 ' ’ 

which has an F-distribution with degrees of freedom (k + 1) and («, + n 2 — 
2k - 2). This test is derived in the appendix to this chapter. 


Example 1: Stability of the Demand for Food Function 

Consider the data in Table 4.3, where we fitted separate demand functions for 
1927-1941 and 1948-1962 and for the entire period. Suppose that we want to 
test the stability of the parameters in the demand function between the two 
periods. The required numbers are given in Table 4.4. 

For equation 1 we have 

URSS = sum of RSSs from the two separate regressions 
for 1927-1941 and 1948-1962 
= 0.1151 + 0.0544 = 0.1695 
with d.f. = 12 + 12 = 24 
RRSS = RSS from a regression for the pooled data 
= 0.2866 with d.f. = 27 

This regression from the pooled data imposes the restriction that the parame¬ 
ters are the same in the two periods. Hence 

„ (0.2866 - 0.1695V3 „„ 

F "-0695/24- = 5 53 

From the F-tables with d.f. 3 and 24 we see that the 5% point is about 3.01 and 
the 1% point is about 4.72. Thus, even at the 1% level of significance, we reject 
the hypothesis of stability. Thus there is no case for pooling. 

Note that if we look at the individual coefficients, (3, is almost the same for 
the two regressions. Thus it appears that the price elasticity has been constant 
but it is the income elasticity that has changed in the two periods. In Chapter 
8 we discuss procedures of testing the stability of individual coefficients using 
the dummy variable method. 

Consider now equation 2. We now have 

URSS = 0.1151 + 0.0535 = 0.1686 with d.f. = 11 + 11 = 22 

RRSS = 0.2412 with d.f. = 26 

Hence 


„ (0.2412 - 0.1686V4 „ „ 

F -01686/22- = 237 

From the F-tables with d.f. 4 and 22 we see that the 5% point is about 2.82. 
So, at the 5% significance level, we do not reject the hypothesis of stability. 

One can ask how this result came about. If we look at the individual coeffi¬ 
cients for Equation 2 for the two periods 1927-1941 and 1948-1962 separately, 
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we notice that the /-ratios are very small, that is, the standard errors are very 
high relative to the magnitudes of the coefficient estimates. Thus the observed 
differences in the coefficient estimates between the two periods would not be 
statistically significant. When we look at the regression for the pooled data we 
notice that the coefficient of the interaction term is significant, but the estimates 
for the two periods separately, as well as the rejection of the hypothesis of 
stability for equation 1, casts doubt on the desirability of including the inter¬ 
action term. 


Example 2: Stability of Production Functions 

Consider the data in Table 3.11 on output and labor and capital inputs in the 
United States for 1929-1967. 16 The variables are: 

X = index of gross domestic product 
(constant dollars) 

L t = labor input index (number of persons 
adjusted for hours of work and 
educational level) 


L 2 = persons engaged 

Ki = capital input index (capital stock adjusted 
for rates of utilization) 


K 2 = capital stock in constant dollars 
We will estimate regression equations of the form 


log X = a + (3, log L + p 2 log K + u (4.23) 


First, considering the two measures of labor and capital inputs, we obtain the 
following results: 


log X = -3.938 + 1.451 log L, + 0.384 log K x 

(0 237) (0 083) (0 048) 


R 2 = 0.9946 R 2 = 0.9943 RSS = 0.0434 


(4.24) 


d 2 = 0.001205 


log X = 


-6.388 + 2.082 log L 2 + 0.571 log K 2 

(0 294) (0 100) (0 067) 


R 2 = 0.9831 RSS = 0.1363 


Figures in parentheses are standard errors. Since the dependent variable is 
the same and the number of independent variables is the same, the R 2 ’s are 
comparable. A comparison of the R 2 ’s indicates that L t and K x are better ex- 


l6 The data are from L. R. Christensen and D. W. Jorgenson, “U.S. Real Product and Real 
Factor Input 1929-67,” Review of Income and Wealth, March 1970. 
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planatory variables than L 2 and K 2 . Hence we will conduct further analysis 
with L, and K t only. 

One other thing we notice is that all variables increase with time. Regressing 
each of the variables on time we find: 


log X ~- 

= 4.897 

+ 

0.0395* 

(0 032) 


(0 0014) 

log L\ -- 

= 4.989 

+ 

00 

o 

o 

(0.020) 


(0 0009) 

log K t = 

= 4.171 

+ 

0.0324; 

(0 031) 


(0 0013) 


R 2 = 0.9549 
R 2 = 0.9238 
R 2 = 0.9408 


We can eliminate these time trends from these variables and then estimate the 
production function with the trend-adjusted data. But this is the same as in¬ 
cluding / as an extra explanatory variable (see the discussion in Section 4.4). 17 
Thus we get the result 

log X = —3.015 + 1.341 log L, + 0.292 log K x + 0.0052/ 

(0 091) (0 060) (0 0022) 

R 2 = 0.9954 R 2 = 0.9949 RSS = 0.0375 d 2 = 0.001072 

Comparing this result with (4.24), we notice that p, + p 2 h as gone down from 
1.451 + 0.384 = 1.835 to 1.341 + 0.292 = 1.633. p, + p 2 measures returns to 
scale. Also, the R 2 has increased, or equivalently, d 2 has decreased. This is to 
be expected because the /-value for the last variable is greater than 1. 

Although we do not need to test the hypothesis of constant returns to scale, 
that is p, + p 2 = 1, we will illustrate it. The estimated variances and covari¬ 
ances (from the SAS regression package used) were 

estimate of E(p,) = 0.008353 estimate of E(p 2 ) = 0.003581 

estimate of cov(p,, p 2 ) = -0.001552 

Thus 

estimate of V(0, + p 2 - 1) = 0.008353 + 0.003581 - 2(0.001552) 

= 0.008830 

Hence 

SE(P, + p 2 - 1) = V0.008830 = 0.094 
The /-statistic to test p, + p 2 - 1 = 0 is 

, , = (P, + p 2 - 1) - (0) _ 0.633 

I SE(P, + 02-1) 0.094 

From the /-tables with 34 d.f. we find that the 5% significance point is 2.03 and 
the \% significance point is 2.73. Thus we reject the hypothesis of constant 
returns to scale even at the 1% level of significance. 


l7 This result is commonly known as the Frisch-Waugh theorem. 
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Suppose that we estimate the production function (4.23) for the two periods 
1929-1948 and 1949-1967 separately. We get the following results: 

1929-1948 


log X = -4.058 + 1.617 log L, + 0.220 log K t (4.25) 

(0 357) (0 209) (0 230) 

R 2 = 0.9759 RSS = 0.03555 d.f. = 17 (4.26) 

1949-1967 

log X = -2.498 + 1.009 log F, + 0.579 log F, 

(0 531) (0 144) (0 055) 

R 2 = 0.9958 RSS = 0.00336 d.f. = 16 
Applying the test for stability (4.22), we get 

URSS = 0.03555 + 0.00336 = 0.0389 with d.f. = 17 + 16 = 33 
RRSS = 0.0434 [from (4.24) earlier] with d.f. = 36 


Thus 


(0.0434 - 0,0389)/3 _ 

0.0389/33 “ ' 

From the F-tables with 3 and 33 d.f. we find that the 5% point is 2.9. Thus at 
the 5% significance level we do not reject the hypothesis of stability. 

Again, looking at the individual coefficient estimates for the two periods, this 
result is perplexing. We will consider some other tests for stability and see 
whether these tests confirm this result. 


Predictive Tests for Stability 

The analysis-of-variance test that we have discussed is also commonly referred 
to as the Chow test, although it had been known much earlier. 18 Chow suggests 
another test that can be used even when ft 2 < (k + 1). This is the predictive 
test for stability. The idea behind is this: We use the first n, observations to 
estimate the regression equation and use it to get predictions for the next n 2 
observations. Then we test the hypothesis that the prediction errors have mean 
zero. If n 2 = 1, we just use the method described in Section 4.7. If n 2 > 1, the 
F-test is given by 

l8 G. C. Chow, “Tests of Equality Between Subsets of Coefficients in Two Linear Regression 
Models,” Econometrica, 1960, pp. 591-605. The paper suggests two tests: the analysis-of-vari¬ 
ance test and the predictive test. The former test, although referred to as the Chow test, was 
discussed earlier in C. R. Rao, Advanced Statistical Methods in Biometric Research (New 
York: Wiley, 1952), and S. Kullback and H. M. Rosenblatt, “On the Analysis of Multiple 
Regression in k Categories,” Biomettika, 1957, pp. 67-83. Thus it is the second test—the pre¬ 
dictive test—that should be called the Chow test. 
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(RSS - RSS,)/w 2 
“ RSS,/(n, - k - 1) 

which has an F-distribution with d.f. n 2 and n, - k - 1. Here 


(4.27) 


RSS = residual sum of squares from the regression based on n, + n 2 
observations; this has («, + n 2 ) — (k + 1) d.f. 

RSS, = residual sum of squares from the regression based on 
n, observations; this has «, — k — 1 d.f. 


The proof is given in the appendix to this chapter. 19 In Chapter 8 we give a 
dummy variable method of applying the same test. We will first illustrate the 
use of this test and then discuss its advantages and limitations. 


Illustrative Example 

Consider the example of demand for food that we considered earlier. We have, 
for equation 1, from Table 4.4: 

For 1927-1941: RSS, = 0.1151 
For 1948-1962: RSS 2 = 0.0544 
Combined data: RSS = 0.2866 

Considering the predictions for 1948-1962 using the estimated equation for 
1927-1941, we have 

(RSS - RSS,)/n 2 = (0.2866 - 0.1151)/15 = 

“ RSS,/(«, - k - 1) “ 0.1151/12 

From the F-tables with d.f. 15 and 12 we find that the 5% point is 2.62. Thus 
at the 5% level of significance, we do not reject the hypothesis of stability. The 
analysis of variance test led to the opposite conclusion. 

For equation 2 we have (from Table 4.4): 

RSS = 0.1151 RSS 2 = 0.0535 RSS = 0.2412 

The F-test is 

_ (0.2412 - O H51V15 = „ -80 
0.1151/11 

Thus at the 5% significance level we do not reject the hypothesis of stability. 
Thus the conclusions of the predictive test does not seem to be different from 
that of the analysis-of-variance test, for this equation. 

However, for the predictive test, we can also reverse the roles of samples 1 
and 2. That is, we can also ask the question of how well the equation fitted for 
the second period predicts for the first period. If the coefficients are stable, we 
should do well. The F-test for this is now (interchanging 1 and 2) 


l9 The proof given in the appendix follows Maddala, Econometrics, pp. 459-460. 
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(RSS - RSS 2 )/k, 
“ RSS An-, - k - 1) 


which has an F-distribution with d.f. n t and n 2 — k — 1. For Equation 2 we 
have 


(0.2412 - 0.0535)/! 5 
0.0535/11 


2.57 


From the F-tables with d.f. 15 and 11 we see that the 5% point is 2.72 and the 
1% point is 4.25. Thus, even at the 5% significance level, we cannot reject the 
hypothesis of stability of the coefficients. 

Chow suggested the predictive test for the case where n 2 is less than k + 1. 
In this case the regression equation cannot be estimated with the second sample 
and thus the analysis-of-variance test cannot be applied. In this case only the 
predictive can be used. He also suggested that the predictive test can be used 
even when n 2 > (k + 1) but that in this case the analysis-of-variance test should 
be preferred because it is more powerful. 

In our example we have n 2 > (k + 1), but we have also used two predictive 
tests. In practice it is desirable to use both predictive tests as illustrated in our 
example. 


Comments 

1. Wilson 20 argues that though the Chow test (the predictive test) has been 
suggested only for the case n 2 <(k + 1), that is, for the case when the analysis- 
of-variance test cannot be used, the test has desirable power properties when 
there are some unknown specification errors. Hence it should be used even 
when n 2 > (k + 1), that is, even in those cases where the analysis-of-variance 
test can be computed. We have illustrated how the predictive test can be used 
in two ways in this case. 

2. Rea 21 argues that in the case n 2 < (k + 1) the Chow test cannot be con¬ 
sidered a test for stability. All it tests is that the prediction error has mean zero, 
that is, the predictions are unbiased. If the coefficients are stable, the prediction 
error will have zero mean. But the converse need not be true in the case n 2 < 
(k + 1). The prediction error can have a zero mean even if the coefficients are 
not stable, if the explanatory variables have moved in an offsetting manner. 
Rea’s conclusion is that “the Chow test is incapable of testing the hypothesis 
of equality against that of inequality. It can never be argued from the Chow test 
itself that the two sets [of regression coefficients] are equal, although at times 
it may be possible to conclude that they are unequal.” This does not mean that 
the Chow test is not useful. Instead of calling it a test for stability we would 
call it a test for unbiasedness in prediction. Note that in the case both and 


“A. L. Wilson, “When Is the Chow Test UMP?” The American Statistician, Vol. 32, No. 2, 
May 1978, pp. 66-68. 

2I J. D. Rea, “Indeterminacy of the Chow Test When the Number of Observations Is Insuffi¬ 
cient,” Econometrica, Vol. 46, No. 1, January 1978, p. 229. 
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n 2 are greater than (k + 1), the two predictive tests that we have illustrated are 
tests for stability. 

3. Another problem with the application of the tests for stability, which ap¬ 
plies to both the analysis of variance and predictive tests, is that the tests are 
inaccurate if the error variances in the two samples are unequal. 22 The true size 
of the test (under the null hypothesis) may not equal the prescribed a level. 
For this reason it would be desirable to test the equality of the variances. 

Consider, for instance, the error variances for equation 1 in Table 4.4. The 
F-statistic to test equality of error variances is 

V 2 


F = 


<T- 


0.9594 

0.4534 


2.12 


For the F-distribution with d.f. 12 and 12 the 5% point is 2.69. Thus we do not 
reject the hypothesis of equality at the 5% significance level. 

For Equation 2 the corresponding test statistic is 




f 


1.0462 

0.4866 


2.15 


Again if we use a 5% significance level, we do not reject the hypothesis of 
equality of the error variances. 

Thus, in both cases we might be tempted to conclude that we can apply the 
tests for stability. There is, however, one problem with such a conclusion. This 
is that the F-test for equality of variances is a pretest, that is, it is a test prelim¬ 
inary to the test for stability. There is the question of what significance level 
we should use for such pretests. The general conclusion is that for pretests one 
should use a higher significance level than 5%. In fact, 25 to 50% is a good rule. 
If this is done, we would reject the hypothesis of equality of variances in the 
case of both equations 1 and 2. 


*4.12 The LR, W, and LM Tests 


In the Appendix to Chapter 3 we stated large-sample test statistics to test the 
hypothesis (3 = 0. These were 



LM = nr 1 


“This was pointed out m T Toyoda, “Use of the Chow Test Under Heteroskedasticity,” Econo- 
metrica, 1976, pp 601-608 The approximations used by Toyoda were found to be inaccurate, 
but the inaccuracy of the Chow test holds good See P Schmidt and R Sickles, “Some Further 
Evidence on the Use of the Chow Test Under Heteroskedasticity,” Econometrica, Vol 45, No. 
5, July 1977, pp. 1293-1298 
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Each has a x 2 -distribution with I d.f. In the multiple regression model, to test 
the hypothesis (3, = 0 we use these test statistics with the corresponding partial 
r 2 substituted in the place of the simple r 2 . The test statistics have a x 2 -distri- 
bution with d.f. 1. 

To test hypotheses such as 


Pi ~~ P 2 ~ ' ' ' ~ P* ~ 0 

we have to substitute the multiple R 2 in place of the simple r 2 or partial r 2 in 
these formulae. The test statistics have a x 2 -distribution with d.f. k. 

To test any linear restrictions, we saw (in the Appendix to Chapter 3) that 
the likelihood-ratio test statistic was 


LR = n log f 


/ RRSS \ 

Vurss/ 


where RRSS = restricted residual sum of squares 
URSS = unrestricted residual sum of squares 

LR has a x 2 -distribution d.f. r, where r is the number of restrictions. The test 
statistics for the Wald test and the LM test are given by 

= RRSS - URSS 
URSS//! 

RRSS - URSS 
RRSS//! 


Both W and LM have a x 2 -distribution d.f. r. The inequality W > LR > LM 
again holds and the proof is the same as that given in the Appendix to Chap¬ 
ter 3. 


Illustrative Example 

Consider example 1 in Section 4.11 (stability of the demand for food function). 
For equation 1 we have 

URSS = 0.1695 RRSS = 0.2866 n = 30 
and the number of restrictions r = 3. We get 

W = 20.73 
LR = 15.76 
^ LM = 12.62 

Looking at the x 2 tables for 3 d.f. the 0.01 significance point is 11.3. Thus all 
the test statistics are significant at that level, rejecting the hypothesis of coef¬ 
ficient stability. As we saw earlier, the F-test also rejected the hypothesis at the 
1% significance level. 

Turning to equation 2, we have 

URSS = 0.1686 RRSS = 0.2397 n = 30 • r = 4 
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We now get 


W = 12.65 
LR = 10.56 
LM = 8.90 

From the x 2 tables with 4 d.f. the 5% significance point is 9.49. Thus both the 
W and LR tests reject the hypothesis of coefficient stability at the 5% signifi¬ 
cance level, whereas the LM test does not. There is thus a conflict among the 
three test criteria. We saw earlier that the F-test did not reject the hypothesis 
at the 5% significance level either. 

The conflict between the W, LR, and LM tests has been attributed to the 
fact that in small samples the actual significance levels may deviate substan¬ 
tially from the normal significance levels. That is, although we said we were 
testing the hypothesis of coefficient stability at the 5% significance level, we 
were in effect testing it at different levels for the different tests. Procedures 
have been developed to correct this problem but a discussion of these proce¬ 
dures is beyond the scope of this book. The suggested formulas are too com¬ 
plicated to be discussed here. However, the elementary introduction of these 
tests given here will be useful in understanding some other tests discussed in 
Chapters 4, 5, and 6. 


Summary 

This chapter is very long and hence summaries will be presented by sections. 

1. Sections 4.2 to 4.5: Model with Two 
Explanatory Variables 

We discuss the model with two explanatory variables in great detail because it 
clarifies many aspects of multiple regression. Of special interest are the expres¬ 
sions for the variances of the estimates of the regression coefficients given at 
the beginning of Section 4.3. These expressions are used repeatedly later in the 
book. Also, it is important to keep in mind the distinction between separate 
confidence intervals for each individual parameter and joint confidence inter¬ 
vals for sets of parameters (discussed in Section 4.3). Similarly, there can be 
conflicts between tests for each coefficient separately (/-tests) and tests for a 
set of coefficients (F-test). (This is discussed in greater detail in Section 4.10.) 
Finally, in Section 4.5 it is shown that each coefficient in a multiple regression 
involving two variables can be interpreted as the regression coefficient in a 
simple regression involving two variables after removing the effect of all other 
variables on these two variables. This interpretation is useful in many problems 
and will be used in other parts of the book. 
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2. The illustrative examples given at the end of Section 4.3 show how some¬ 
times we can get “wrong” signs for some of the coefficients and how this can 
change with the addition or deletions of variables. (We discuss this in more 
detail in Section 4.10.) The example of gasoline demand represents one where 
the results obtained were poor and further analysis with the data was left as an 
exercise. 

3. Section 4.6: Simple, Partial, and 
Multiple Correlations 

In multiple regression it is important to note that there is no necessary rela¬ 
tionship between the simple correlation between two variables y and x and the 
partial correlation between these variables after allowing for the effect of other 
variables (see Table 4.2 for an illustration). There is. however, some relation¬ 
ship between R 2 and the simple r 2 and partial r 2 ' s. This is given in equation 
(4.12). Also partial r 2 = flit 1 + d.f.) is a useful relationship. Some examples 
are given to illustrate these relationships. 

4. Section 4.7: Prediction 

In the case of the simple regression model (Section 3.7) the variance of the 
prediction increased with the distance x 0 from x. In the case of prediction from 
the multiple regression model this is not necessarily the case. An example is 
given to illustrate this point. Again, as in the case of the simple regression 
model, we can consider prediction of y 0 or prediction of E(y 0 ). The predicted 
value will be the same in both the cases. However, the variance of the predic¬ 
tion error will be different. In the case of prediction of E(y 0 ) we have to subtract 
cr 2 from the corresponding expression for the prediction of >’ () . Note that we did 
not discuss the prediction of E(y 0 ) here as we did this in the simple regression 
case (Section 3.7). 

5. Section 4.8: Tests of Hypotheses 

Tests of single parameters and single linear functions of parameters will be /- 
tests. Tests of several parameters and several linear functions of parameters 
are E-tests. These are both illustrated with examples. Note again that there can 
be conflicts between the two tests. For instance, the /-statistics for each coef¬ 
ficient can be nonsignificant and yet the E-statistic for a set of coefficients can 
be significant. 

6. Section 4.9: Omitted Variables and 
Irrelevant Variables 

The omission of relevant variables produces biased estimates. Expressions are 
given in equations (4.16M4.18) for the omitted variable bias. The variance of 
the estimated coefficient will be smaller, although the estimated variance (or 
standard error) need not be. These points are illustrated with examples. The 
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case with the inclusion of irrelevant variables is different. There is no bias. 
However, the variance of the estimated coefficients increases. Thus we get 
unbiased but inefficient estimators. These are all only statistical guidelines re¬ 
garding omission of relevant and inclusion of irrelevant variables. 

7 . Section 4.10: R 2 

The addition of explanatory variables always increases R 2 . This does not mean 
that the regression equation is improving. The appropriate thing to look at is 
the estimate of the error variance. An equivalent measure is R 2 , the value of 
R 2 adjusted for the loss in degrees of freedom due to the addition of more ex¬ 
planatory variables. It is given by equation (4.20). A procedure usually fol¬ 
lowed is to keep on adding variables until the R 2 stops increasing. Apart from 
the lack of any economic rationale, there are some pitfalls in this procedure. 
The R 2 might increase by the addition (or deletion) of two or more variables 
even though it might not if one variable is added (or dropped) at a time. Some 
rules are given for the prediction of sign changes in the estimates of the coef¬ 
ficients of the retained variables when a variable is deleted. Although maximi¬ 
zation of R 2 and mechanical deletion of nonsignificant variables have serious 
pitfalls, these rules provide some useful predictions. 

8. Section 4.11: Test for Stability 

In multiple regression analysis we are often concerned with the stability of the 
estimated relationships across two samples of sizes «, and n 2 . We discuss and 
illustrate two tests: the analysis-of-variance test (AV test) and the predictive 
test (Chow test). In practice it is desirable to use both tests. If either n, or n 2 
is not greater than the number of regression parameters estimated, the AV test 
cannot be used but the Chow test can be. However, in this case the Chow test 
is not a test for stability. It is merely a test for unbiasedness of predictions. 


Exercises 


More difficult exercises are marked with an *. 

1. Define the following terms. 

(a) Standard error of the regression. 

(b) R 2 and R 2 . 

(c) Partial r 2 . 

(d) Tests for stability. 

(e) Degrees of freedom. 

(f) Linear functions of parameters. 

(g) Nested and nonnested hypotheses. 

(h) Analysis of variance. 
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2. In a multiple regression equation, show how you can obtain the partial 
r 2 's given the r-ratios for the different coefficients. 

3. In the multiple regression equation 

y = a + p,x, + p 2 x 2 + £3*3 + u 

Explain how you will test the joint hypothesis p, = p 2 and p 3 = 1. 

4. The following regression equation is estimated as a production function. 

log Q = 1.37 + 0.632 log K + 0.452 log L R 2 = 0.98 

(0 257) (0.219) 

co \(b K , b,) = -0.044. The sample size is 40. Test the following hy¬ 
potheses at the 5% level of significance. 

(a) b K = b L . 

(b) There are constant returns to scale. 

5. Indicate whether each of the following statements is true (T), false (F), or 
uncertain (U), and give a brief explanation or proof. 

(a) Suppose that the coefficient of a variable in a regression equation 
is significantly different from zero at the 20% level. If we drop this 
variable from the regression, both R 2 and R 2 will necessarily de¬ 
crease. 

(b) Compared with the unconstrained regression, estimation of a least 
squares regression under a constraint (say, p 2 = p 3 ) will result in a 
higher R 2 if the constraint is true and a lower R 2 if it is false. 

(c) In a least squares regression of y on x, observations for which x is 
far from its mean will have more effect on the estimated slope than 
observations for which x is close to its mean value. 

6. The following estimated equation was obtained by ordinary least squares 
regression using quarterly data for 1960 to 1979 inclusive (T = 80). 

y , = 2.20 + 0.104*1, + 3.48x 2 , + 0.34x 3 , 

(3.4) (0.005) (2 2) (0.15) 

f 

Standard errors are in parentheses, the explained sum of squares was 
112.5, and the residual sum of squares was 19.5. 

(a) Which of the slope coefficients are significantly different from zero 
at the 5% significance level? 

(b) Calculate the value of R 2 for this regression. 

(c) Calculate the value of R 2 (“adjusted R 2 "). 

7. Suppose that you are given two sets of samples with the following infor¬ 
mation: 


Sample 1 Sample 2 


n = 

20 

n = 

25 

X = 

20 

* = 

23 

y = 

25 

y = 

28 

S xx = 

80 

S xx — 

100 

S x y ~ 

120 

$xy ~ 

150 

Syy = 

200 

5 VV = 

250 
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(a) Estimate a linear regression equation for each sample separately 
and for the pooled sample. 

(b) State the assumptions under which estimation of the pooled regres¬ 
sion is valid. 

(c) Explain how you will test the validity of these assumptions using 
the data provided. 

8. A researcher tried two specifications of a regression equation. 

y = a + P* + u 
y = a' + P'x + y'z + u' 

Explain under what circumstances the following will be true. (A “hat” 
over a parameter denotes its estimate.) 

(a) p = p'. 

(b) If u, and u', are the estimated residuals from the two equations 

J] «? ^ 2 «'?■ 

(c) p is statistically significant (at the 5% level) but p' is not. 

(d) p' is statistically significant (at the 5% level) but p is not. 

9. The model 


y, = Po + Pl*l, + P 2 * 2 , + p3*3, + U < 

was estimated by ordinary least squares from 26 observations. The results 
were 

y, = 2 4- 3.5x„ — 0.7 x 2 , + 2.Qx 3( 

( 19 ) a 2 ) (1 5 ) 

t-ratios are in parentheses and R 2 = 0.982. The same model was estimated 
with the restriction p, = p 2 . Estimates were: 

y, = 1.5 + 3(jc„ + x 2l ) — 0.6x 3 , R 2 = 0.876 

(2 7) (2 4) 

(a) Test the significance of the restriction p, = p 2 . State the assump¬ 
tions under which the test is valid. 

(b) Suppose that jc 2 , is dropped from the equation: would the R 2 rise or 
fall? 

(c) Would the R 2 rise or fall if x 2l is dropped? 

10. Suppose that the least squares regression of Y on jc„ x 2 , . . . , x k yields 
coefficient estimates bj(j= 1, 2, . . . , k) none of which exceed their re¬ 
spective standard errors. However, the F-ratio for the equation rejects, at 
the 0.05 level the hypothesis that = b 2 = ■ ■ • = b k = 0. 

(a) Is this possible? 

(b) What do you think is the reason for this? 

(c) What further analysis would you perform? 

11. What would be your answer to the following queries regarding multiple 
regression analysis? 

(a) I am trying to find out why people go bankrupt. I have gathered 
data from a sample of people filing bankruptcy petitions. Will these 
data enable me to find answers to my question? 
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(b) I want to study the costs of auto accidents. I have collected data 
from a sample of police auto accident reports. Are these data ade¬ 
quate for my purpose? 

(c) I am trying to estimate a consumption function and I suspect that 
the marginal propensity to consume varies inversely with the rate 
of interest. Do I run a multiple regression using income and interest 
rate as explanatory variables? 

(d) I am fitting a demand for food function for a sample of 1000 families. 
I obtain an R 2 of only 0.05 but the regression program indicates that 
the F-statistic for the equation is very significant and so are the t- 
statistics. How can this be? Is there a mistake in the program? 

(e) In the regression of Y on x and z, should I leave one of them out? 

(f) I know that y depends linearly on x but I am not sure whether or 
not it also depends on another variable z. A friend of mine suggests 
that I should regress y on x first, calculate the residuals, and then 
see whether they are correlated with z. Is she correct? 

12. A student obtains the following results in several different regression 
problems. In which cases could you be certain that an error has been 
committed? Explain. 

(a) R\ m = 0.89, R\ mA = 0.86 

(b) R\ 123 = 0.86, R\ 1234 = 0.82 

(c) r 2 n2 = 0.23, 4,, = 0.13, R\ m = 0.70 

(d) Same as part (c) but 4,3 = 0 

13. Given the following estimated regression equations 

C, = const. + 0.92 Y, 

C, = const. + 0.84C4, 

C,_, = const. + 0.78 Y, 

Y, = const. + 0.55C,„, 

calculate the regression estimates of p, and p 2 for 
C, = p 0 + PiE, + p 2 C,_, + u, 

*14. Instead of estimating the coefficients p, and p 2 from the model 

y = a + p,x, + p 2 x 2 + u 

it is decided to use ordinary least squares on the following regression 
equation: 

y = a + p,4 + p 2 x 2 + v 

where x\ is the residual from a regression of x, and x 2 and v is the distur¬ 
bance term. 

(a) Show that the resulting estimator of p 2 is identical to the regression 
coefficient of y on x 2 . 

(b) Obtain an expression for the bias of this estimator. 

(c) Prove that the estimators of p, obtained from each of the two equa¬ 
tions are identical. 
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*15. Explain how you will estimate a linear regression equation which is piece- 
wise linear with a joint (or knot) at x = x 0 if 

(a) A' (l is known. 

(b) x 0 is unknown. 

*16. In the model 


Vi = Pl*l, + p2*2, + p3*3, + U, 

the coefficients are known to be related to a more basic economic param¬ 
eter a according to the equations 


P, + P 2 = a 

Pi + P 3 = -a 

Assuming that the jc’s are nonrandom and that u, ~ IN(0, cr 2 ), find the best 
unbiased linear estimator & of a and the variance of &. 

17. A study on unemployment in the British interwar period produced the 
following regression equation (data are given in Table 4.11): 

U = 5.19 + 18.3 (B/W) - 90.0(log Q - log Q *) 

(2.0) (4 46) (-8.3) 

R 2 = 0.8 SER = 1.9 where SER = V?- S 
Sample period 1920-1938 (n = 19). 


U = 
B/W = 

Q = 
Q* = 

log Q - log Q* = 


unemployment rate 

ratio of unemployment benefits to average wage 

actual output 

trend predicted output 

captures unexpected changes in aggregate demand 


The authors 23 conclude that the high benefit levels are partly responsible 
for the high rates of unemployment. Critics of this study argued that when 
the single observation for 1920 is dropped the results change dramati¬ 
cally. 24 The equation now is 


U = 7.9 + 12.9 (B/W) - 87.0(log Q - log Q *) 

(3.0) (2 4) (8.3) 

R 2 = 0.82 SER = 1.7 


Sample period 1921-1938 (n = 18). 

Test whether the results are significantly different from each other. 


2, D. K. Benjamin and L. A. Kochin, “Searching for an Explanation of Unemployment in Inter¬ 
war Britain,” Journal of Political Economy, June 1979, pp. 441-478. 

24 P. A. Ormerod and G. D. N. Worswick, “Unemployment in Interwar Britain,” Journal of 
Political Economy, April 1982, pp. 400-409. 
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18. In a study on determinants of children born in the Philippines, 25 the fol¬ 
lowing results were obtained: 


Variable 

Coefficient 

SE 

t-Ratio 

Coefficient 

SE 

t-Ratio 

ED 

-0.177 

0.026 

-6.81 

-0.067 

0.024 

-2.79 

LWH 

0.476 

0.327 

1.46 

0.091 

0.289 

0.31 

AMAR 

-- 

— 

— 

-0.296 

0.016 

-18.50 

SURV 

-0.006 

0.003 

-2.00 

-0.003 

0.003 

-1.00 

RURAL 

0.361 

0.193 

1.87 

0.281 

0.171 

1.64 

Age 

0.123 

0.024 

5.12 

0.155 

0.021 

7.36 

Constant 

5.650 

3.180 

1.78 

9.440 

2.820 

3.35 

R 2 


0.096 



0.295 



The variables are: 


ED = 
LWH = 

AMAR = 
SURV = 

RURAL = 


Age = 


years of schooling of the woman 

natural logarithm of present 

value of husband’s earnings at marriage 

age of the woman at marriage 

survival probability at 

age 5 in the province 

residence in rural area (dummy variable); 

this variable is supposed to 

capture search and schooling costs 

age of the woman 


The explained variable is number of children born. 

(a) Do the coefficients have the signs you would expect? 

(b) Using the /-ratio for AMAR and the R 2 for the two equations, can 
you tell how many observations were used in the estimation? 

(c) Looking at the /-ratio of AMAR, can you predict the signs of the 
coefficients of the other variables if AMAR is deleted from the equa¬ 
tion? 

(d) Given that the dependent variable is number of children born, do 
you think the assumptions of the least squares model are satisfied? 

19. In a study of investment plans and realizations in U.K. manufacturing 
industries since 1955, the following results were obtained: 


0.89 DW = 2.50 
R 2 = 0.68 DW = 


A, = const. - 54.60C,_, R 2 

(6 24) 

1, - A, = const. - 19.96(C, - C,_,) 

(4 44 ) 

I, = const. + 0.88A, - 16.32(C, - C,_,) R 2 = 0.90 

(0 10) (5 15) 

I, = const. - 50.08C,„, - 14.60(C, - C,_,) 

(3 64 ) (3 32 ) 

R 2 = 0.96 DW = 2.61 


2.31 

DW = 1.65 


25 B. L. Boulier and M. R. Rosenzweig, “Schooling, Search and Spouse Selection: Testing Eco¬ 
nomic Theories of Marriage and Household Behavior,” Journal of Political Economy, August 
1984, p. 729. 
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A, = investment that firms anticipate they will complete in year t\ 
these plans are held at the end of year t — 1 
I, = actual investment in year t 

C, = measure of average level of under utilization of capacity 

Figures in parentheses are standard errors. 

(a) Interpret these results and assess whether or not knowledge of 
firms’ anticipated investment is helpful in explaining actual invest¬ 
ment. 

(b) How many observations have been used in the estimation? 

(c) What is the partial correlation coefficient of /, with A, after allowing 
for the effect of C, - C,. , ? 

20. The demand for Ceylonese tea in the United States is given by the equa¬ 
tion 

log Q = 0 O + Pi log P c + P 2 log P, + 03 log P B + 04 log Y + u 

where Q = imports of Ceylon tea in the United States 
P c = price of Ceylon tea 
Pi = price of Indian tea 
P B = price of Brazilian coffee 
Y = disposable income 

The following results were obtained from T = 22 observations, 
log Q = 2.837 - 1.481 log P c + 1.181 log P I + 0.186 log P B 

(2 0) (0 987) (0 690) (0 134) 

+ 0.257 log Y RSS = 0.4277 

( 370) 

log Q + log P c = - 0.738 + 0.199 log P B + 0.261 log Y RSS = 0.6788 

(0 820) (0 155) (0 165) 

Figures in parentheses are standard errors. 

(a) Test the hypothesis 0! = — 1, 0 2 = 0, and 0 3 , 0 4 ¥= 0 against 0, A 0 
for / = 1, 2, 3,4. 

(b) Discuss the economic implications of these results. 


Appendix to Chapter 4 


The Multiple Regression Model in Matrix Notation 

Consider the multiple regression model with k explanatory variables: 
y, = 0,*,, + 0 2 x 2 , + • • • + p \ k x k , + u, i = 1, 2, . . . , n 
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This can be written as 


yi 


*!1 

*21 

**“ 


pn 


U 1 

y 2 

= 

*12 

*22 

. **2 



+ 

« 2 

_ y „_ 


Jin 

*2 „ 

*kn_ 




Jb,_ 


(4A.1) 


or y = xp + u, where 

y = an n x l vector of observations on the explained variable 
X = an n x k matrix of observations on the explanatory variables 
u = an n x 1 vector of errors 
P = a k x 1 vector of parameters to be estimated 
We assume that: 


1. The errors are IID(0,cr 2 ), that is, independently and identically distributed 
with mean 0 and variance <r 2 . 

2. The x’s are nonstochastic and hence independent of the us. 

3. The x’s are linearly independent. Hence rank (X'X) = rank X = k. This 
implies that (X'X) 1 exists. 

Under these assumptions the best (minimum variance) unbiased linear esti¬ 
mator (BLUE) of p is obtained by minimizing the error sum of squares: 

Q = u'u = (y - Xp)'(y - Xp) 


This is known as the Gauss-Markoff theorem. 

We shall derive the formula for this estimator and show that it is a linear 
estimator, that it is unbiased, and that it has minimum variance among the class 
of linear unbiased estimators. That will complete the proof of the Gauss- 
Markoff theorem. 


Derivation 

We have Q = y'y - 2pX'y + P'X'Xp. Using the formulas for vector differ¬ 
entiation derived in the Appendix to Chapter 2, we get 

^ = 0 gives — 2X'y + 2X'Xp = 0 or 0 = (X'XT'X'y (4A.2) 

ap 

Since (X'X) ~ 'X' is a matrix of constants, the elements of 0 are linear functions 
of the y’s. Hence 0 is a linear estimator. Also, substituting (4A.1) into (4A.2), 
we get 

0 = (X'X)-'X'(Xp + u) = p + (X'X)-'X'u (4A.3) 

Since E(u) = 0, we have £( 0 ) = p. Thus p is an unbiased estimator. Also, 
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U(0) = F(0 - p)(0 - p)' = (X'X)- , X'F(uu')X(X'X)“' 

= (X'X) '<t 2 since F(uu') = /a 2 

The 0 is unbiased and has a covariance matrix (X'X) 1 a 2 . 

Now how do we show that this is minimum variance? Any other linear esti¬ 
mator must be of the form p* = 0 + Cy. Then p* = p + CXp + [(X'X) 'X' 
+ C]u. Hence £(P*) = P + CXp. But if p* is an unbiased estimator for all 
values of p, we should have CX = 0. 

V(p*) = E( p* - p)(p* - p)l= [(X'X)-'X' + C]£(uu')[(X'X)X' + C]' 

Since £(uu') = la 2 and CX = 0, this gives V(P) = (X'X) *tr 2 + (CC')cr 2 . 
Hence, V'(p*) s V(0). Thus 0 is BLUE. Note that to prove 0 is BLUE, we 
did not assume that the errors u, are normal. But to derive the sampling distri¬ 
bution of 0 we have to assume normality. 


Tests of Significance 

We shall now add the assumption that u ~ N„ (0, Icr 2 ). We have from (4A.3), 
X(0 - p) = X(X'X)" , X'u = Mu 


and the estimated residual u is given by 

U = y - x0 = xp + u - X0 = [I - X(X'X) 'xlu = Nu 


where M = X(X'X)"'X' and N = I - M. We shall use the properties of idem- 
potent matrices and the x 2 -distribution stated in the Appendix to Chapter 2. 

1. It can easily be verified that M 2 = M and N 2 = N. Thus M and N are 
idempotent matrices. Also, MN = 0. 

2. Since M is idempotent, Rank(M) - Tr(M). Using the result Tr(AB) = 
Tr(BA), we get 

Tr(M) = TrX(X'X) 'X' = Tr(X'X) '(X'X) = Tr(I*) = k 

Thus Rank(M) = k. Similarly, Rank(N) = n — k. 

3. Hence (1/a 2 ) u'Mu and (1/a 2 u'Nu have independent x 2 -distributions with 
d.f. k and in - k), respectively. 

4. Now the residual sum of squares 

u'u = (y - X0)' (y - X0) = u'N 2 u = u'Nu 
is independent of the regression sum of squares 

(0 - p)X'X(0 - p) = u'MMu = u'M 2 u = u'Mu 


IT (regression S.S.)/k 

Hence - 

(residual S.S.)/(n -k) 

dom k and (n — k). 


has an F-distribution with degrees of free- 
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This result can be used to construct confidence regions for p and also to 
apply any tests of significance. To test the hypothesis p = p 0 , we substitute 
the value p 0 for p in the test statistic above and use it as an F-variate. Whether 
or not the hypothesis is true, the denominator depends on p only and thus 
always has a x 2 -distribution. The numerator has a x 2 -distribution only when the 
null hypothesis is true. When it is false, it has a noncentral x 2 -distribution, and 
this is used to find the power of the test. 

Since regression sum of squares = S VV F 2 and the residual of sum of squares 
= S V> (1 — R 2 ), we can also write the F-test as 

R 2 lk R\n - k) 

F ~ (1 - R 2 )l(n - k) ~ (1 - R 2 )k 

Note that in equation (4.14) we have (n — k - 1) because there is also a con¬ 
stant term in addition to the k p’s. We can then consider p to be a (k + 1) 
vector, and the matrix X to be n x (k + 1), with the first column of X consisting 
of all elements = 1. 


Tests for Stability 

We shall derive the analysis of variance and predictive tests for stability dis¬ 
cussed in Secton 4.11. Let us write 

y, = x,p, + u, for the first n ] observation 
y 2 = x 2 p 2 + u 2 for the second n 2 observation 

Write 

“' fc] x * [*J 

We assume the errors to be IN(0, <r 2 ) in both the equations. If p, = p 2 , we 
estimate the pooled regression equation 

y = Xp + u for the « = («, + n 2 ) observation 

Let RSS, and RSS 2 be the residual sum of squares from the two separate regres¬ 
sions and RRSS be the residual sum of squares from the pooled regression. (It 
is called “restricted” because of the restriction p, = p 2 .) We shall denote (RSS, 
+ RSS 2 ) by URSS (unrestricted residual sum of squares). We have to show 
that 

(RRSS - URSS)/£ 

URSS/(« - k) 

has an F-distribution with d.f. k, n — k. 

Define 



n; = i, - x.tx/x.r'x, 

N 2 = I 2 — X 2 (X 2 'X 2 )-'X 2 
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where I, and I 2 are identity matrices of orders n, and n 2 , respectively. Then 
RSS, = u,N,u and RSS 2 = u 2 N 2 u. If we define Nj 


0 0 
0 n 2 


N, 0 

0 0 


and N, = 


as two n x n matrices, we can write RSS, = u'N,u and RSS 2 


u'Nju. Note that NJN 2 = 0. Also. RRSS = u'Nu, where N = I - X(X'X) 'X'. 
We can write 


N = J j° - (X'XT'tX.X; 

2 J L 2 _ 


N„ N 1; 
N 21 n 2: 


where 


N„ = I, - X,(X'X)-'X, 
n 12 = -X,(X'X)-'X' 2 
N 21 = -X 2 (X'X)“’X', 

N 22 = I 2 - X 2 (X'X)-*X' 2 

Define N* = NJ + N 2 so that we have URSS = u'N u and RRSS = u'Nu. We 
shall show that: 


1. (N - N*) and N* are both idempotent. 

2. (N - N*) • N* = 0. 

3. Tr(N*) = n - 2k and Tr(N - N*) = k. 

4. Hence (RRSS - URSS)/a 2 and URSS/ct 2 have independent x 2 -distribu- 
tions with d.f. k and n - 2k, respectively. From this the required F-ratio 
follows. 

Proof: Since NJ and N 2 are both idempotent, N* is easily seen to be idem- 
potent. If we prove (2), it is easy to show that N - N* is idempotent. Hence 
we shall prove result (2). We have 



(N„ - N,)N, 0" 
N 21 N, 0 


Since X',N, = 0, we have N 21 N, = 0 and N,,N, = N,. Since N, is idempotent, 
it follows that (N„ - N,)N, = 0. Thus (N - Nj)Nj = 0 or NNJ = Nj. Similarly, 
(N - NpNj = 0 or NN; = NJ. Hence it follows that (N - Tf[ - N 2 ) (NJ + 
N 2 ) = 0 or (N - N*)N* = 0. 

Tr(N) = n - k 

Tr(N*) = Tr(Nj) + Tr(N 2 ) = («, - k) + (n 2 - k) = n - 2k 

Hence Tr(N — N*) = Tr(N) - Tr(N*) = k. The rest follows from the relation¬ 
ship between idempotent matrices and x 2 -distribution. 
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Predictive Test for Stability 

If n 2 < k, the regression equation cannot be estimated with n 2 observation. In 
this case the predictive test for stability is to use 

(RRSS - RSS,)/n 2 

RSS,/(«i - k) 

which has an /--distribution with d.f. n 2 and n, - k. We have already derived 
the expressions to prove this. We have shown that (N - Nj)N| = 0. Also, Nj 
is idempotent and of rank («, - k). Hence (N - Nj)(N - Nj) = N 2 - NJN - 
NNj + NJ 2 = N - NJ - NJ + NJ = N - Nj. Thus N - Nj is idempotent. 
Rank(N - Nj) = Tr(N — NJ) = (n — k) — (n, - k) = n 2 . 

RRSS - RSS, = u'(N - Nj)u 
RSS, = u'Nju 

Hence the required result follows. 

Omitted Variables and Irrelevant Variables (Section 4.9) 

Suppose that the true model is 

y = xp + u X is an n X k matrix 
Instead, we estimate 

y = Z8 + v Z is an n x r matrix 

r can be less than, equal to, or greater than k. The variables in Z may include 
some variables in X. We then have 

S = (Z'Z)-'Z'y 

= (Z'Z)“'Z'(Xp + u) 

= Pp + (Z'Z) 1 Z'u 

Since £(u) = 0, we have £(8) = Pp. P = (Z'Z) 'Z'X is the matrix of regression 
coefficients of the variables X in the true model on the variables Z in the mis- 
specified model. As an example, suppose that the true equation is 

y = 0i*i + 02*2 + « 

Instead, we estimate 

y = &i*i + S 2 x, + * 

Then P is obtained by regressing each of jc, and x 2 on x x and x,. The regression 
of x, on x, and x 3 gives coefficients 1 and 0. The regression of x 2 on x, and x, 
gives coefficients (say) b 2l and b 2i . These regressions are known as the auxiliary 
regressions. Hence we get 


i 

<3 

1_ 

1 

<60 1 


*0,' 

LM L° 


.02. 
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or £( 81 ) = Pi + b 21 p 2 and £( 8 2 ) = b 2 $ 2 . 

Suppose that Z includes irrelevant variables, so that the true equation is 


Y = X,0, + u 


and the misspecified equation is 

y = X,0, + X 2 0 2 + v 

In this case, the matrix P is the regression coefficients of X, on X, and X 2 . These 

0! or £(0!) = Pi and £(0 2 ) = 0. Thus even if 

some “irrelevant” variables are included, we get unbiased estimates for the 
coefficients of the “relevant” variables. 



Prior Adjustment (Section 4.4) 

Consider the multiple regression model 

y = X,P, + X 2 p 2 + u 

Let p, be the estimator of 0, from this equation. Suppose that instead of this, 
we consider adjusting both y and X, by removing the effect of X 2 on these 
variables. Let the residuals from a regression of y on X 2 be denoted by y* and 
the residuals from a regression of X, on X 2 be denoted by XJ. Now regress the 
adjusted y* on the adjusted XJ. Let this regression coefficient be b. We shall 
show that 

That is, if we want to remove the effect of X 2 on y and X, before running a 
regression on the adjusted variables, we can get the same result by including 
X 2 as an additional explanatory variable in the regression of y on X,. Usually, 
X 2 is a trend variable or seasonal variables. 

Proof: Let N = I - X 2 (X' 2 X 2 )“‘X 2 . Then as we showed earlier, the residual 
y* = Ny and the residual XJ = NX,. Hence b = (Xj'Xj) '(X”y*) = 
(X'|NX,) _, X',Ny. We have to show that we get the same expression for 0,. We 
have (X'X)P = X'y, which can be written as 

X/X.0, + X/X 2 P 2 = X,'y 
X 2 'X,0i + X 2 'X 2 02 = x 2 'y 

The second equation gives 

0 2 = (X 2 'X 2 )-' [X 2 'y - X 2 'X,0,] 

Substituting this in the first we get 

X/X.P, + X.'XjlXj'X,)- 1 [X 2 'y - X 2 'X I p I ] = X,'y 
orfX/NX.jp, = X/Ny. Thus 0, = b. 
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Data Sets 


Table 4.7 Sale Prices of Rural Land" 


N 

Price 

WL 

DA 

D75 

A 

MO 

1 

5,556 

1.00 

12.1 

4.9 

36.0 

33 

2 

5,236 

1.00 

12.1 

4.9 

38.2 

30 

3 

5,952 

1.00 

12.0 

4.9 

21.0 

15 

4 

7,000 

0.00 

16.0 

1.2 

40.0 

44 

5 

3,750 

0.00 

15.5 

3.2 

40.0 

43 

6 

7,000 

0.00 

13.7 

3.2 

20.0 

25 

7 

5,952 

0.00 

14.5 

2.5 

21.0 

24 

8 

2,009 

0.00 

16.1 

0.1 

656.0 

19 

9 

2,583 

1.00 

15.2 

3.0 

60.0 

18 

10 

2,449 

0.00 

15.5 

1.0 

156.0 

18 

11 

2,500 

0.50 

15.2 

2.0 

40.0 

3 

12 

3,000 

0.00 

15.5 

3.2 

13.0 

3 

13 

3,704 

0.00 

13.5 

2.5 

27.0 

3 

14 

3,500 

0.00 

15.5 

1.0 

10.0 

3 

15 

3,500 

0.00 

17.5 

5.4 

20.0 

38 

16 

4,537 

1.00 

18.0 

5.9 

38.0 

24 

17 

3,700 

0.00 

17.2 

5.1 

5.0 

3 

18 

2,020 

1.00 

34.2 

22.0 

5.0 

27 

19 

5,000 

0.00 

11.1 

5.1 

3.5 

13 

20 

4,764 

0.00 

14.2 

2.0 

237.6 

40 

21 

871 

1.00 

14.2 

2.0 

237.6 

7 

22 

3,500 

1.00 

11.1 

3.1 

20.0 

41 

23 

15,200 

1.00 

14.7 

2.4 

5.0 

36 

24 

4,767 

0.00 

12.1 

4.1 

30.0 

22 

25 

16,316 

1.00 

14.8 

2.5 

3.8 

21 

26 

9,873 

1.00 

14.8 

2.5 

7.9 

17 

27 

5,175 

0.25 

14.2 

2.0 

40.0 

13 

28 

3,977 

0.00 

11.4 

2.9 

8.8 

10 

29 

5,500 

0.20 

18.5 

5.9 

10.0 

38 

30 

7,500 

0.00 

16.5 

3.9 

8.0 

42 

31 

4,545 

1.00 

16.8 

4.5 

97.0 

36 

32 

3,765 

0.72 

18.7 

6.4 

178.0 

29 

33 

5,000 

1.00 

18.4 

6.1 

10.3 

25 

34 

3,300 

0.00 

16.2 

4.0 

525.7 

25 

35 

5,500 

0.00 

18.0 

5.4 

6.0 

21 

36 

5,172 

0.00 

15.0 

2.4 

29.0 

20 

37 

3,571 

0.00 

15.1 

2.5 

21.0 

20 

38 

4,000 

0.00 

18.2 

6.0 

10.0 

15 

39 

4,000 

0.00 

18.4 

6.1 

15.0 

18 
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Table 4.7 ( Cont .) 


N 

Price 

WL 

DA 

D75 

A 

MO 

40 

2,625 

0.00 

15.5 

2.9 

80.0 

10 

41 

2,257 

0.00 

42.8 

30.5 

171.0 

14 

42 

15,504 

0.00 

4.0 

4.5 

38.7 

39 

43 

5,600 

0.00 

3.8 

4.0 

30.0 

25 

44 

8,000 

0.00 

3.5 

4.2 

30.0 

27 

45 

7,700 

0.00 

4.0 

3.8 

15.0 

46 

46 

6,187 

0.00 

4.2 

5.0 

69.5 

15 

47 

7,018 

0.00 

3.5 

4.0 

10.9 

23 

48 

4,821 

0.60 

7.8 

2.7 

224.0 

33 

49 

6,504 

0.00 

14.9 

4.8 

6.4 

40 

50 

5,225 

0.00 

16.2 

5.0 

10.0 

40 

51 

2,500 

0.00 

24.0 

14.0 

40.0 

33 

52 

4,000 

1.00 

22.5 

12.4 

10.0 

20 

53 

3,638 

1.00 

5.0 

2.2 

73.0 

2 

54 

5,400 

0.00 

13.1 

3.7 

10.0 

28 

55 

4,850 

0.00 

12.4 

3.0 

10.0 

28 

56 

1,628 

0.00 

22.8 

12.8 

80.0 

23 

57 

2,780 

0.00 

23.2 

13.2 

10.0 

26 

58 

4,500 

0.00 

21.0 

11.0 

5.0 

28 

59 

5,600 

1.00 

21.2 

11.2 

5.0 

22 

60 

4,750 

1.00 

21.7 

11.7 

5.0 

32 

61 

1,790 

0.00 

18.8 

8.8 

375.0 

38 

62 

2,750 

0.00 

23.4 

13.4 

10.0 

22 

63 

3,250 

0.00 

22.5 

12.5 

10.0 

19 

64 

8,500 

1.00 

16.8 

2.7 

5.0 

27 

65 

5,357 

0.00 

16.8 

2.7 

5.6 

26 

66 

2,500 

0.00 

21.8 

7.5 

50.0 

14 

67 

8,505 

1.00 

20.5 

5.8 

9.7 

12 


“Price, observed land price per acre excluding 
improvements; WL, proportion of acreage that is wooded; 
DA, distance from parcel to Sarasota airport; D75, 
distance from parcel to 1-75; A, acreage of parcel; MO, 
month in which the parcel was sold. 
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Table 4.8 Data on Gasoline Consumption in the United States 0 


Year 

K 

C 

M 

P. 

Pt 

Pop. 

L 

Y 

mi 

9732 

30.87 

14.95 

97.0 

.538 

145 

59.3 

1513 

1948 

9573 

33.39 

14.96 

100.8 

.564 

147 

60.6 

1567 

1949 

9395 

36.35 

14.92 

105.4 

.633 

150 

61.3 

1547 

1950 

9015 

40.33 

14.40 

104.3 

.678 

152 

62.2 

1646 

1951 

9187 

42.68 

14.50 

98.0 

.694 

155 

62.0 

1657 

1952 

9361 

43.82 

14.27 

97.3 

.723 

158 

62.1 

1678 

1953 

9370 

46.46 

14.39 

100.0 

.765 

160 

63.0 

1726 

1954 

9308 

48.41 

14.57 

101.4 

.814 

163 

63.6 

1714 

1955 

9359 

52.09 

14.53 

101.8 

.840 

166 

65.0 

1795 

1956 

9348 

54.25 

14.36 

103.3 

.860 

170 

66.6 

1839 

1957 

9391 

56.38 

14.40 

103.2 

.862 

172 

66.9 

1844 

1958 

9494 

57.39 

14.30 

98.5 

.879 

175 

67.6 

1831 

1959 

9529 

60.13 

14.30 

98.1 

.897 

178 

68.4 

1881 

1960 

9446 

62.26 

14.28 

98.6 

.913 

181 

69.6 

1883 

1961 

9456 

63.87 

14.38 

96.4 

.944 

184 

70.5 

1909 

1962 

9441 

66.64 

14.37 

95.0 

.965 

187 

70.6 

1968 

1963 

9240 

69.84 

14.26 

93.2 

.965 

189 

71.8 

2013 

1964 

9286 

72.97 

14.25 

91.8 

.970 

192 

73.1 

2123 

1965 

9286 

76.63 

14.15 

92.6 

.972 

194 

74.5 

2235 

1966 

9384 

80.11 

14.10 

92.7 

.979 

196 

75.8 

2331 

1967 

9399 

82.37 

14.05 

93.1 

1.000 

199 

77.3 

2398 

1968 

9488 

85.79 

13.91 

90.8 

1.004 

201 

78.7 

2480 

1969 

9633 

89.16 

13.75 

89.0 

1.026 

203 

80.7 

2517 


°K, miles traveled per car per year: C, number of cars (millions); M, miles per gallon; P K , 
retail gasoline price index deflated by consumer price index (1953 = 100); P T , price of public 
transport (1967 = 100); Pop., population (millions); L, labor force (millions); Y, per capita 
disposable income in 1958 prices. 
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Table 4.9 Data on Demand for Food and Supply of Food in the United States" 


Year 

Q d 

P D 

Y 

Qs 

Ps 

t 

1922 

98.6 

100.2 

87.4 

108.5 

99.1 

1 

1923 

101.2 

101.6 

97.6 

110.1 

99.1 

2 

1924 

102.4 

100.5 

96.7 

110.4 

98.9 

3 

1925 

100.9 

106.0 

98.2 

104.3 

110.8 

4 

1926 

102.3 

108.7 

99.8 

107.2 

108.2 

5 

1927 

101.5 

106.7 

100.5 

105.8 

105.6 

6 

1928 

101.6 

106.7 

103.2 

107.8 

109.8 

7 

1929 

101.6 

108.2 

107.8 

103.4 

108.7 

8 

1930 

99.8 

105.5 

96.6 

102.7 

100.6 

9 

1931 

100.3 

95.6 

88.9 

104.1 

81.0 

10 

1932 

97.6 

88.6 

75.1 

99.2 

68.6 

11 

1933 

97.2 

91.0 

76.9 

99.7 

70.9 

12 

1934 

97.3 

97.9 

84.6 

102.0 

81.4 

13 

1935 

96.0 

102.3 

90.6 

94.3 

102.3 

14 

1936 

99.2 

102.2 

103.1 

97.7 

105.0 

15 

1937 

100.3 

102.5 

105.1 

101.1 

110.5 

16 

1938 

100.3 

97.0 

96.4 

102.3 

92.5 

17 

1939 

104.1 

95.8 

104.4 

104.4 

89.3 

18 

1940 

105.3 

96.4 

110.7 

108.5 

93.0 

19 

1941 

107.6 

100.3 

127.1 

111.3 

106.6 

20 


•Qd, food consumption per capita; Q s , food production per capita; P D , food prices at retail 
level/cost of living index; Y, disposable income/cost of living index; P s , prices received by 
farmers for food/cost of living; t, time. 

Source: M.A. Girschick and T. Haarelmo, “Statistical Analysis of the Demand for Food,” 
Econometrica, April 1947. 


Table 4.10 Data on Housing Starts in Canada (1954-4 to 1982-4 Quarterly)" 


T 

HS 

y 

RR 

T 

HS 

y 

RR 

54-4 

28.73 

101.8 

3.92 

66-2 

29.11 

185.8 

2.73 

55-1 

29.83 

103.3 

4.25 

-3 

29.83 

191.0 

2.44 

-2 

31.22 

107.4 

4.42 

-4 

32.95 

186.9 

2.38 

-3 

32.61 

113.7 

4.63 

67-1 

30.04 

190.3 

2.29 

-4 

30.47 

110.8 

4.90 

-2 

43.53 

193.5 

2.14 

56-1 

31.34 

117.2 

4.63 

-3 

42.41 

194.5 

2.13 

-2 

32.42 

118.3 

5.16 

-4 

35.87 

194.0 

2.39 

-3 

30.56 

116.8 

4.64 

68-1 

46.34 

197.4 

2.83 

-4 

22.01 

121.7 

4.09 

-2 

48.37 

202.6 

2.70 

57-1 

17.75 

122.8 

3.12 

-3 

42.49 

208.0 

2.40 

-2 

29.41 

122.5 

2.79 

-4 

51.00 

209.1 

2.53 

-3 

28.76 

118.8 

2.96 

69-1 

66.32 

211.7 

2.93 

-4 

30.39 

121.8 

3.47 

-2 

54.51 

213.2 

3.42 

58-1 

36.67 

121.5 

3.71 

-3 

47.52 

218.1 

4.06 

-2 

41.27 

124.2 

3.38 

-4 

41.58 

218.0 

4.09 

-3 

36.18 

123.2 

3.04 

70-1 

40.85 

218.3 

4.25 

-4 

37.46 

127.7 

3.07 

-2 

34.54 

220.4 

3.77 

59-1 

32.99 

127.2 

3.42 

-3 

44.43 

222.4 

3.29 

-2 

32.49 

128.8 

4.52 

-4 

60.78 

221.7 

3.62 

-3 

32.98 

128.4 

4.87 

71-1 

49.18 

227.8 

3.36 

-4 

33.32 

131.5 

4.59 

-2 

55.81 

233.3 

3.76 

60-1 

21.95 

133.4 

4.41 

-3 

57.23 

239.8 

2.97 

-2 

23.54 

130.5 

4.52 

-4 

59.27 

242.0 

2.02 

-3 

27.05 

133.3 

4.54 

72-1 

62.12 

243.8 

1.93 

-4 

27.34 

133.8 

4.55 

-2 

62.08 

250.5 

1.68 

61-1 

31.90 

132.5 

4.54 

-3 

60.51 

249.2 

1.60 

-2 

29.88 

137.1 

4.36 

-4 

57.25 

257.7 

1.05 

-3 

30.07 

135.1 

4.47 

73-1 

62.39 

264.7 

0.88 

-4 

27.39 

141.5 

4.45 

-2 

68.25 

267.2 

1.65 

62-1 

30.58 

143.3 

4.34 

-3 

65.32 

267.0 

1.92 

-2 

31.97 

143.3 

4.05 

-4 

62.45 

278.1 

1.46 

-3 

31.53 

148.4 

4.38 

74-1 

68.48 

280.6 

1.21 

-4 

28.36 

148.2 

3.96 

-2 

62.68 

279.6 

2.34 

63-1 

31.94 

148.2 

3.72 

-3 

50.43 

274.5 

2.50 

-2 

32.78 

150.1 

3.63 

-4 

40.33 

281.5 

1.37 

-3 

34.97 

157.1 

3.84 

75-1 

37.28 

278.9 

-0.96 

-4 

39.12 

157.7 

3.85 

-2 

51.37 

280.4 

-1.23 

64-1 

42.82 

160.7 

3.75 

-3 

60.49 

282.7 

-0.09 

-2 

32.84 

162.2 

3.69 

-4 

67.14 

286.8 

0.78 

-3 

37.40 

165.2 

3.81 

76-1 

66.58 

294.4 

1.42 

-4 

46.10 

166.6 

3.90 

-2 

70.57 

298.9 

1.87 

65-1 

40.63 

169.9 

3.68 

-3 

64.94 

299.0 

2.54 

-2 

38.23 

172.4 

3.47 

-4 

63.81 

299.0 

2.88 

-3 

39.63 

177.0 

3.34 

77-1 

53.48 

301.3 

2.36 

-4 

40.38 

179.0 

3.28 

-2 

61.83 

302.5 

1.65 

66-1 

41.12 

183.4 

3.17 

-3 

61.79 

304.9 

0.71 
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Table 4.10 ( Cont.) 


T 

HS 

y 

RR 

T 

HS 

y 

RR 

77-4 

59.81 

307.9 

0.54 

80-3 

41.39 

324.3 

2.69 

78-1 

69.51 

310.5 

0.59 

-4 

40.57 

332.6 

4.97 

-2 

50.31 

312.7 

1.62 

81-1 

39.14 

332.0 

7.53 

-3 

55.22 

318.4 

2.28 

-2 

53.69 

338.5 

8.56 

-4 

53.05 

318.9 

4.00 

-3 

47.06 

338.3 

11.37 

79-1 

46.19 

322.5 

4.28 

-4 

33.87 

335.5 

8.00 

-2 

48.53 

322.3 

3.73 

82-1 

40.71 

319.8 

6.27 

-3 

48.29 

324.6 

3.85 

-2 

28.61 

320.9 

7.41 

-4 

49.24 

328.1 

5.85 

-3 

25.43 

322.4 

6.26 

80-1 

38.24 

325.6 

6.00 

-4 

32.20 

316.6 

3.12 

-2 

35.33 

321.3 

5.26 






"T, year and quarter; HS, housing starts; y, gross national expenditures (in 1971 $); RR, 
estimated real interest rate. 

Source: R. Davidson and J. G. MacKinnon, “Testing Linear and Log-Linear Regressions 
Against Box-Cox Alternatives,” Canadian Journal of Economics, August 1985, Table 5, pp. 
515-516. Data on HS and y have been rounded to four-digit numbers. 


Table 4.11 Wages, Benefits, Unemployment, and Net National Product: United 
Kingdom, 1920-1938 


Year 

Weekly 

Wages, 

W(s.) 

Weekly 

Benefits, 

B(s.) 

Unemployment 
Rate U (%) 

Benefits / 
Wages 

NNP, Q ‘ 
(£ million 
at 1938 
factor 
cost) 

1920 

73.8 

11.3 

3.9 

0.15 

3426 

1921 

70.6 

16.83 

17.0 

0.24 

3242 

1922 

59.1 

22.00 

14.3 

0.37 

3384 

1923 

55.5 

22.00 

11.7 

0.40 

3514 

1924 

56.0 

23.67 

10.3 

0.42 

3622 

1925 

56.4 

27.00 

11.3 

0.48 

3840 

1926 

55.8 

27.00 

12.5 

0.48 

3656 

1927 

56.2 

27.00 

9.7 

0.48 

3937 

1928 

55.7 

27.67 

10.8 

0.50 

4003 

1929 

55.8 

28.00 

10.4 

0.50 

4097 

1930 

55.7 

29.50 

16.1 

0.53 

4082 

1931 

54.9 

29.54 

21.3 

0.54 

3832 

1932 

54.0 

27.25 

22.1 

0.50 

3828 

1933 

53.7 

27.25 

19.9 

0.51 

3899 

1934 

54.3 

28.6 

16.7 

0.53 

4196 

1935 

55.0 

30.3 

15.5 

0.55 

4365 

1936 

56.1 

32.00 

13.1 

0.57 

4498 

1937 

57.2 

32.00 

10.8 

0.56 

4665 

1938 

58.9 

32.75 

12.9 

0.56 

4807 


“NNP, net national product; s., shilling = — £. 
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Heteroskedasticity 


5.1 Introduction 

5.2 Detection of Heteroskedasticity 

5.3 Consequences of Heteroskedasticity 

5.4 Solutions to the Heteroskedasticity Problem 

5.5 Heteroskedasticity and the Use of Deflators 

*5.6 Testing the Linear Versus Log-Linear Functional Form 
Summary 
Exercises 

Appendix to Chapter 5 

5.1 Introduction 


One of the assumptions we have made until now is that the errors u, in the 
regression equation have a common variance cr 1 2 3 . This is known as the homo- 
skedasticity assumption. If the errors do not have a constant variance we say 
they are heteroskedastic. There are sesveral questions we might want to ask if 
the errors do not have a constant variance. These are: 

1. How do we detect this problem? 

2. What are the consequences on the properties of the least squares esti¬ 
mators, and what are the consequences on the estimated standard errors 
if we use OLS? 

3. What are the solutions to this problem? 
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We will answer these questions in the following sections. But first, we will 
consider an example to illustrate the problem. 


Illustrative Example 

Table 5.1 presents consumption expenditures O') and income (x) for 20 families. 
Suppose that we estimate the equation by ordinary least squares. We get 


y = 0.847 + 0.899x 

( 0 . 703 ) ( 0 . 0253 ) 


R 2 = 0.986 
RSS = 31.074 


(Figures in parentheses are standard errors.) We can compute the predicted 
values and the residuals. In Table 5.2 we present the residuals for the obser¬ 
vations which are ordered by their x-values. As we can easily see they are 
larger (absolutely) for larger values of x. Thus there is some evidence that the 
error variances are not constant but increase with the value of x. In Figure 5.1 
we show the plot of the residuals. This shows graphically (perhaps more than 
Table 5.2) that there is a heteroskedasticity problem. 

Sometimes, the heteroskedasticity problem is solved by estimating the 
regression in a log-linear form. When we regress log y on log x, the estimated 
equation is 


log y = 0.0757 + 0.9562 log x 

( 0 . 0574 ) ( 0 . 0183 ) 


R 2 = 0.9935 
RSS = 0.03757 


(Figures in parentheses are standard errors.) The R 2 's are not comparable since 
the variance of the dependent variable is different. We discuss the problem of 
comparing R 2 's from linear versus log-linear form in Section 5.6. The residuals 



Figure 5.1. Example of heteroskedasticity. 
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Table 5.1 Consumption Expenditures (>’) and 
Income (a) for 20 Families (Thousands of Dollars) 


Family 

y 

X 

Family 

y 

X 

1 

19.9 

22.3 

11 

8.0 

8.1 

2 

31.2 

32.3 

12 

33.1 

34.5 

3 

31.8 

36.6 

13 

33.5 

38.0 

4 

12.1 

12.1 

14 

13.1 

14.1 

5 

40.7 

42.3 

15 

14.8 

16.4 

6 

6.1 

6.2 

16 

21.6 

24.1 

7 

38.6 

44.7 

17 

29.3 

30.1 

8 

25.5 

26.1 

18 

25.0 

28.3 

9 

10.3 

10.3 

19 

17.9 

18.2 

10 

38.8 

40.2 

20 

19.8 

20.1 


Table 5.2 Residuals for the Consumption Function Estimated from Data in Table 5.1 
(Ordered by Their x-Values)" 


Observation 

Value 

of X 

Residual 

Observation 

Value 

of X 

Residual 

6 

6.2 

-0.32 

8 

26.1 

1.18 

11 

8.1 

-0.13 

18 

28.3 

- 1.30 

9 

10.3 

0.19 

17 

30.1 

1.38 

4 

12.1 

0.37 

2 

32.3 

1.30 

14 

14.1 

-0.43 

12 

34.5 

1.23 

15 

16.4 

-0.80 

3 

36.6 

-1.96 

19 

18.2 

0.69 

13 

38.0 

-1.52 

20 

20.1 

0.88 

10 

40.2 

1.80 

1 

22.3 

-1.00 

5 

42.3 

1.81 

16 

24.1 

-0.92 

7 

44.7 

-2.45 


"Residuals are rounded to two decimals. 


from this equation are presented in Table 5.3. In this situation there is no per¬ 
ceptible increase in the magnitudes of the residuals as the value of x increases. 
Thus there does not appear to be a heteroskedasticity problem. 

In any case, we need formal tests of the hypothesis of homoskedasticity. 
These are discussed in the following sections. 

5.2 Detection of Heteroskedasticity 


In the illustrative example in Section 5.1 we plotted the estimated residual u, 
against x, to see whether we notice any systematic pattern in the residuals that 
suggests heteroskedasticity in the errors. Note, however, that by virtue of the 
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Table 5.3 Residuals from the Log-Linear Equation" 


Observation 

logx 

Residual 

Observation 

logx 

Residual 

6 

1.82 

-0.12 

8 

3.26 

0.44 

11 

2.09 

0.04 

18 

3.34 

-0.53 

9 

2.33 

0.27 

17 

3.40 

0.47 

4 

2.49 

0.34 

2 

3.48 

0.42 

14 

2.65 

-0.33 

12 

3.54 

0.38 

15 

2.80 

-0.56 

3 

3.60 

-0.59 

19 

2.90 

0.35 

13 

3.64 

-0.42 

20 

3.00 

0.41 

10 

3.69 

0.51 

1 

3.10 

-0.54 

5 

3.74 

0.50 

16 

3.18 

-0.46 

7 

3.80 

-0.56 


"Log x rounded to two decimals and residual multiplied by 10. 


normal equation, it, and x, are uncorrelated though it; could be correlated with 
x,. Thus if we are using a regression procedure to test for heteroskedasticity, 
we should use a regression of it, on x], x 3 , ... or a regression of |w,| or 
uj on x„ x 2 , x 3 , .... In the case of multiple regression, we should use powers 
of S, the predicted value of y„ or powers of all the explanatory variables. 

1. The test suggested by Anscombe 1 and a test called RESET suggested by 
Ramsey 2 both involve regressing u, on y 2 , y 3 , . . . and testing whether or 
not the coefficients are significant. 

2. The test suggested by White 3 involves regressing u 2 on all the ex¬ 
planatory variables and their squares and cross products. For instance, 
with three explanatory variables x u x 2 , x 3 , it involves regressing u 2 on x,, 
x 2 , x 3 , x 2 , x 2 , x 2 , x,x 2> x 2 x 3 , and x 3 x,. 

3. Glejser 4 suggested estimating regressions of the type \it\ = a + (3x„ 

|(2,| = a + (3/x„ |<J,| = a + ft\Zx„ and so on, and testing the hypothesis 

p = 0. 

The implicit assumption behind all these tests is that var(«,) = of = cr 2 /(z,) 
where z, is an unknown variable and the different tests use different proxies or 
surrogates for the unknown function/(z). 


’F. J. Anscombe, “Examination of Residuals,” Proceedings of the Fourth Berkeley Symposium 
on Mathematical Statistics and Probability (Berkeley, Calif.: University of California Press, 
1961), pp. 1-36. 

2 J. B. Ramsey, “Tests for Specification Errors in Classical Linear Least Squares Regression 
Analysis,” Journal of the Royal Statistical Society, Series B, Vol. 31, 1969, pp. 350-371. 

3 H. White, “A Heteroskedasticity Consistent Covariance Matrix Estimator and a Direct Test 
of Heteroskedasticity,” Econometrica. Vol. 48. 1980, pp. 817-838. 

4 H. Glejser, “A New Test for Homoscedasticity,” Journal of the American Statistical Associ¬ 
ation, 1969, pp. 316-323. 



5.2 DETECTION OF HETEROSKEDASTICITY 


205 


Illustrative Example 

For the data in Table 5.1, using the residuals in Table 5.2 the Ramsey, White, 
and Glejser tests produced the following results. Since there is only one ex¬ 
planatory variable x, we can use x instead of y in the application of the Ramsey 
test. The test thus involves regressing u, on xj, x', and so on. In our example 
the test was not very useful. The results were (standard errors are not reported 
because the R 2 is very low for sample size 20): 

u = -0.379 + 0.236 x 10~ 2 x 2 - 0.549 x 10‘ 4 jc 3 R 2 = 0.034 

None of the coefficients had a r-ratio > 1 indicating that we are unable to reject 
the hypothesis that the errors are homoskedastic. 

The test suggested by White involves regressing u 2 on x, x 2 , x\ and so on. 
The results were: 

u 2 = -1.370 + 0.116jc R 2 = 0.7911 

(0 390) (0 014) 

u 2 = 0.493 - 0.07lx + 0.0037X 2 R 2 = 0.878 

(0 620) (0 055) (0 0011) 

The R 2 ' s are highly significant in both the cases. Thus the test rejects the hy¬ 
pothesis of homoskedasticity. A suggested procedure to correct for heteroske- 
dasticity is to estimate the regression model assuming that V(u ,) = 
of, where a 2 = + y { x + y 2 x 2 . This procedure is discussed in Section 5.4. 

Glejser’s tests gave the following results: 

\&\ = -0.209 + 0.0512x R 2 = 0.927 

(0 094) (0 0034) 

|w| = -1.232 + 0.475vG? R 2 = 0.902 

(0 186) (0 037) 

|0| = 1.826 - 13.78(l/x) R 2 = 0.649 

(0 155) (2 39) 

All the tests reject the hypothesis of homoskedasticity, although on the basis 
of R 2 , the first model is preferable to the others. The suggested model to esti¬ 
mate is the same as that suggested by White’s test. 

The results are similar for the log-linear form as well, although the coeffi¬ 
cients are not as significant. Using the residuals in Table 5.3 we get, for the 
White test. 


a 2 = 

-0.211 

+ 0.129* 


R 2 = 0.572 


(0 083) 

(0 026) 



it 

-0.620 

+ 0.425* - 

0.051* 2 

R 2 = 0.600 


(0 385) 

(0 273) 

(0 047) 



Thus there is evidence of heteroskedasticity even in the log-linear form, al¬ 
though casually looking at the residuals in Table 5.3, we concluded earlier that 
the errors were homoskedastic. The Goldfeld-Quandt test, to be discussed 
later in this section, also did not reject the hypothesis of homoskedasticity. The 
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Glejser tests, however, show significant heteroskedasticity in the log-linear 
form. 


Some Other Tests 

The Likelihood Ratio Test 

If the number of observations is large, one can use a likelihood ratio test. Di¬ 
vide the residuals (estimated from the OLS regression) into k groups with n, 
observations in the /th group, 2 «, = «. Estimate the error variances in each 
group by 67. Let the estimate of the error variance from the entire sample be 
ct 2 . Then if we define A as 

k 

a = n 

/-I 

- 2 log,,A has a x 2 -distribution with degrees of freedom (k - 1). If there is only 
one explanatory variable in the equation, the ordering of the residuals can be 
based on the absolute magnitude of this variable. But if there are two or more 
variables and no single variable can provide a satisfactory ordering, then y, the 
predicted value of y, can be used. 

Feldstein used this LR test for his hospital cost regressions described in 
Chapter 4 (Section 4.6, Example 1). He divided the total number of observa¬ 
tions (177) into four groups of equal size, the residuals being ordered by the 
predicted values of the dependent variable. The estimates of were 71.47, 
114.82, 102.81, and 239.34. The estimate d 2 for the whole sample was 138.76. 
Thus - 2 logA was 18.265. The 1% significance point for the x 2 -distribution with 
3 d.f. is 11.34. Thus there were significant differences between the error vari¬ 
ances. Next Feldstein weighted the observations by weights proportional to 
1/d,. The weights normalized to make their average equal to 1 were 1.2599, 
0.9940, 1.0504, and 0.6885. This would make the error variances approximately 
equal. The equation was estimated by OLS using the transformed data. This 
procedure is called weighted least squares and is often denoted by WLS. The 
new estimates of the variances from this reestimated equation, for the four 
groups were 106.34, 110.71, 114.99, and 117.06, which are almost equal. How¬ 
ever, the regression parameters did not change much, as shown in Table 5.4. 
Although the point estimates did not change much, the standard errors could 
be different. Since Feldstein does not present these, we have no way of com¬ 
paring them. 

Goldfeld and Quandt Test 

If we do not have large samples, we can use the Goldfeld and Quandt test. 5 In 
this test we split the observations into two groups—one corresponding to large 
values of x and the other corresponding to small values of x —fit separate 


3 S. M. Goldfeld and R. E. Quandt, Nonlinear Methods in Econometrics (Amsterdam: North- 
Holland, 1972), Chap. 3. 
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Table 5.4 Comparison of OLS and WLS Estimates for Hospital-Cost Regression 



Average Cost per Case 

Case Type 

OLS 

WLS 

General medicine 

114.48 

111.81 

Pediatrics 

24.97 

28.35 

General surgery 

32.70 

35.07 

ENT 

15.25 

15.58 

Traumatic and orthopedic surgery 

39.69 

36.04 

Other surgery 

98.02 

101.38 

Gynecology 

58.72 

58.48 

Obstetrics 

34.88 

34.50 

Others 

69.51 

66.26 


Source: M. S. Feldstein, Economic Analysis for Health Service Efficiency (Amsterdam: 
North-Holland, 1967), p. 54. 


regressions for each and then apply an F-test to test the equality of error vari¬ 
ances. Goldfeld and Quandt suggest omitting some observations in the middle 
to increase our ability to discriminate between the two error variances. 

Breusch and Pagan Test 

Suppose that V{u ,) = a]. If there are some variables z„ z 2 , . . . , z r that influ¬ 
ence the error variance and if aj - /(a 0 + a,z„ + a 2 z 2 , + • • • + a r z r; ), then 
the Breusch and Pagan test 6 is a test of the hypothesis 

H 0 : a, = a 2 = • • • = a r = 0 

The function/(-) can be any function. For instance, /(a:) can be x, x 2 , e\ and 
so on. The Breusch and Pagan test does not depend on the functional form. 
Let 


n 

Regression sum of squares from a regression 
of u] on z„ z 2 , - ■ • , z r 

Then X = S 0 /2 (t 4 has a x 2 distribution with degrees of freedom r. 

This test is an asymptotic test. An intuitive justification for the test will be 
given after an illustrative example. 



Illustrative Example 

Consider the data in Table 5.1. To apply the Goldfeld-Quandt test we consider 
two groups of 10 observations each, ordered by the values of the variable x. 
The first group consists of observations 6, 11, 9, 4, 14, 15, 19, 20, 1, and 16. 
The second group consists of the remaining 10. The estimated equations were: 


T. S. Breusch and A. R. Pagan, “A Simple Test for Heteroscedasticity and Random Coefficient 
Variation,” Econometrica, Vol. 47, 1979, pp. 1287-1294. 
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Group 1: 

y = 1.0533 

+ 0.876* 

R 2 


(0 616) 

(0 038) 

a! 

or 

Group 2: 

y = 3.279 

+ 0.835* 

R 2 


(3 443) 

(0 096) 

d 2 


0.985 

0.475 

0.904 

3.154 


The F-ratio for the test is 


3.154 

0.475 


6.64 


The 1% point for the F-distribution with d.f. 8 and 8 is 6.03. Thus the F-value 
is significant at the 1% level and we reject the hypothesis of homoskedasticity. 
Consider now the logarithmic form. The results are: 


Group 1: 

logy 

= 0.128 

+ 0.934* 

R 2 

= 0.992 



(0 079) 

(0 030) 

d 2 

= 0.001596 

Group 2: 

logy 

= 0.276 

+ 0.902* 

R 2 

= 0.912 



(0 352) 

(0 099) 

d 2 

= 0.002789 


The F-ratio for the test is 


0.002789 

0.001596 


1.75 


For d.f. 8 and 8, the 5% point from the F-tables is 3.44. Thus if we use the 
5% significance level, we do not reject the hypothesis of homoskedasticity. 
Hence by the Goldfeld-Quandt test we reject the hypothesis of homoskedas¬ 
ticity if we consider the linear form but do not reject it in the log-linear form. 
Note that the White test rejected the hypothesis in both the forms. 

Turning now to the Breusch and Pagan test, the regression of w 2 
on x„ x] and x] gave the following regression sums of squares. For the linear 
form 


5 = 40.842 for the regression of u 2 on x„ xj, x} 


S = 40.065 for the regression of u 2 on x„ x 2 


Also, d 2 = 1.726. The test statistic for the x 2 test is (using the second regres¬ 
sion) 


5 40.065 

2d 4 “ 2(2.979) 


6.724 


We use the statistic as a x 2 with d.f. = 2 since two slope parameters are esti¬ 
mated. This is significant at the 5 percent level, thus, rejecting the hypothesis 
of homoskedasticity. 

For the log-linear form, using only x, and x; as regressors we get: 5 = 
0.000011 and d 2 = 0.00209. The test statistic is 
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0.000011 

2(0.00209) 2 


1.259 


using the x 2 tables with 2 d.f. we see that this is not significant at even the 50 
percent level. Thus, the test does not reject the hypothesis of homoskedasticity 
in the log-linear form. 


An Intuitive justification for the Breusch—Pagan Test 

In Section 4.12 we discussed the LM, W, and LR tests. There we showed that 
the LM test statistic was LM = nR 2 , which has a x 2 -distribution. Arguing from 
analogy, we would get a test statistic nR 2 , where R 2 is the multiple correlation 

coefficient in a regression of u 2 on z u , z 2r z 3l .To see the relationship 

between this and the Breusch-Pagan test statistic, note that 

2 _ Regr.SS from a regression of u) on z„, z 2 „ z 3 „ . ■ . _ S 
n var (//?) var (u 2 ) 

Now, under the null hypothesis that the errors are homoskedastic, ujla 2 has a 
X 2 -distribution with 1 d.f. Hence var(w 2 /cr 2 ) = 2 (because the variance of a x 2 
variable is twice the degrees of freedom). Thus 

var(w 2 ) = 2a 4 

In large samples we can write var(n 2 ) = var (u 2 ) and cr 4 = cr 4 . Hence we get 
var (u 2 ) — 2ct 4 . Thus the test statistic is ,5/2cr 4 , as described earlier. The statistic 
can also be viewed as half the regression sum of squares from a regression of 
g, = wj/tf 2 on z it , z lv z 3 „ .... Breusch and Pagan argue that in discussions of 
heteroskedasticity. if one is going to plot any quantity, it is more reasonable to 
plot g, than quantities like u t . 


5.3 Consequences of Heteroskedasticity 


Before we attempt solutions to the heteroskedasticity problem, we will study 
the consequences on the least squares estimators. We will show that 

1. The least squares estimators are still unbiased but inefficient. 

2. The estimates of the variances are also biased, thus invalidating the tests 
of significance. 

To see this, consider a very simple model with no constant term. 

y, = 0*, + «, V(u,) = a 2 

The least squares estimator of 0 is 

X x,y, 2 x,u, 

B — "vi — — B + T~ 


(5.1) 
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If £(«,) = 0 and u, are independent of the x„ we have E(Z x,ujZ *?) = 0 and 
hence E{ P) = (3. Thus p is unbiased. 

If the u, are mutually independent, denoting Z A f by S xx we can write 


V(p) 


-i 


*1 


Ml + 


•*2 


111 + 



si 


(X|<T, + Jt 2 °2 + 


+ X Wn) 


Z -*f°f 

(Z^y 


(5.2) 


Suppose that we write of = a 2 z 2 , where z, are known, that is, we know the 
variances up to a multiplicative constant. Then dividing (5.1) by z, we have the 
model 


- = P- + v, (5.3) 

Z t 

where v, = ujz, has a constant variance cr 2 . Since we are “weighting” the ith 
observation by 1 lz„ the OLS estimation of (5.3) is called weighted least squares 
(WLS). If p* is the WLS estimator of p, we have 

4 Z (y/z,)(*/z,) 
p ’ Swrf 

Z (*/z>. 

^ + Z (xjz ,) 2 

and since the latter term has expectation zero, we have £(P*) = p. Thus the 
WLS estimator p* is also unbiased. We will show that p* is more efficient than 
the OLS estimator p. 

We have 


<T 


^ 

and substituting of = o 2 z 2 in (5.2), we have 

Z 


V(p) = cr 2 


N 



V(P) 2 (xf/z 2 ) Z 


Thus 



5 3 CONSEQUENCES OF HETEROSKEDASTICITY 


211 


This expression is of the form ^2 where o, = x,z, and b, = 

x,/z,. Thus it is less than 1 and is equal to 1 only if a, and b, are proportional, 
that is, x,z, and xjz, are proportional or zf is a constant, which is the case if the 
errors are homoskedastic. 

Thus the OLS estimator is unbiased but less efficient (has a higher variance) 
than the WLS estimator. Turning now to the estimation of the variance of p, it 
is estimated by 

RSS 1 
n - 1 2 

where RSS is the residual sum of squares from the OLS model. But 


Z?(RSS) = E 


= 2 ^ 


O’, - fo ) 2 

X ■*?<*? 

2 *, 2 


Note that if of = a 2 for all i, this reduces to (n — 1 )a 2 . Thus we would be 
estimating the variance of p by an expression whose expected value is 

2 X of - X xfuf 
C n ~ 1)^2 X ?J 

whereas the true variance is 



Thus the estimated variances are also biased. If of and x] are positively cor¬ 
related, as is often the case with economic data so that 2 x ?°f > 
(1 In) 2 x] 2 of. then the expected value of the estimated variance is smaller 
than the true variance. Thus we would be underestimating the true variance of 
the OLS estimator and getting shorter confidence intervals than the true ones. 
This also affects tests of hypotheses about p. 


Estimation of the Variance of the OLS Estimator 
Under Heteroskedasticity 

The solution to the heteroskedasticity problem depends on the assumptions we 
make about the sources of heteroskedasticity. When we are not sure of this, 
we can at least try to make corrections for the standard errors, since we have 
seen that the least squares estimator is unbiased but inefficient, and moreover, 
the standard errors are also biased. 
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White 7 suggests that we use the formula (5.2) with uf substituted for of. Us¬ 
ing this formula we find that in the case of the illustrative example with data in 
Table 5.1, the standard error of p, the slope coefficient is 0.027. Earlier, we 
estimated it from the OLS regression as 0.0253. Thus the difference is really 
not very large in this example. 

5.4 Solutions to the 

Heteroskedasticity Problem 


There are two types of solutions that have been suggested in the literature for 
the problem of heteroskedasticity: 

1. Solutions dependent on particular assumptions about of. 

2. General solutions. 

We first discuss category 1. Here we have two methods of estimation: 
weighted least squares and maximum likelihood (ML). 

If the error variances are known up to a multiplicative constant, there is no 
problem at all. If V(«,) = cr 2 z, 2 , where z, are known, we divide the equation 
through by z, and use ordinary least squares. The only thing to remember is 
that if the original equation contained a constant term, that is, y ; = a + px, + 
m„ the transformed equation will not have a constant term. It is 

V/ 1 „JC: 

— = a—I- p —\- V/ 

* 

where v, = w,/z,. Now V(v t ) = cr 2 for all /. Thus we should be running a regres¬ 
sion of y,/z, on 1 Izi and xJz, without a constant term. 

One interesting case is where V(w ; ) = cr 2 x, and a = 0. In this case the trans¬ 
formed equation is 


Vi 


~ 2 (VF) 2 " X 

that is, the WLS estimator is just the ratio of the means. Another case is where 
of = idxf. In this case the transformed equation is 

- = a- + P + v, 
x, x, 


Hence 


_2i_ _ + 

vf, 


2 (ypci/xf 


’White, “A Heteroskedasticity.” A similar suggestion was made earlier in C. R. Rao, “Esti¬ 
mation of Heteroskedastic Variances in Linear Models,” Journal of the American Statistical 
Association, March 1970, pp. 161-172. 
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Thus the constant term in this equation is the slope coefficient in the original 
equation. 

Paris and Houthakker 8 found in their analysis of family budget data that the 
errors from the equation had variance increasing with household income. They 
considered a model 07 = cr[£(y,)] 2 , that is, 07 = cr 2 (a + px,) 2 . In this case we 
cannot divide the whole equation by a known constant as before. For this 
model we can consider a two-step procedure as follows. First estimate a and p 
by OLS. Let these estimators be a and p. Now use the WLS procedure as 
outlined earlier, that is, regress y,/(a + px,) on l/(a + 0 x,) and x,/(a + px,) 
with no constant term. This procedure is called a two-step weighted least 
squares procedure. The standard errors we get for the estimates of a and p 
from this procedure are valid only asymptotically. They are asymptotic stan¬ 
dard errors because the weights l/(a + Px,) have been estimated. 

One can iterate this WLS procedure further, that is, use the new estimates 
of a and p to construct new weights and then use the WLS procedure, and 
repeat this procedure until convergence. This procedure is called the iterated 
weighted least squares procedure. However, there is no gain in (asymptotic) 
efficiency by iteration. 

If we make some specific assumptions about the errors, say that they are 
normal, we can use the maximum likelihood method, which is more efficient 
than the WLS if errors are normal . 9 Under the assumption of normality we can 
write the log-likelihood function as 


log L = -n log or - 2 log(a + px,) 


1 y / y, ~ a ~ P»Y 
2a 2 ^ \ a + px, / 


Note that maximizing this likelihood function is not the same as the weighted 
least squares that minimizes the expression 

y ( y< ~ a ~ 

Z \ a+P*. i 

A more general model is to assume that the variance erf is equal to (7 + 8 x,) 2 . 
In this case, too, we can consider a WLS procedure, that is, minimize 

y ( y, - « ~ 

.\ 7 + 8 x, J 


or if the errors can be assumed to follow a known distribution, use the ML 
method. For instance, if the errors follow a normal distribution, we can write 
the log-likelihood function as 


8 S. J. Prais and H. S. Houthakker, The Analysis of Family Budgets (New York: Cambridge 
University Press, 1955), p. 55ff. 

9 Amemiya discusses the ML estimation for this model when the errors follow a normal, log¬ 
normal, and gamma distribution. See T. Amemiya, “Regression Analysis When Variance of the 
Dependent Variables Is Proportional to Square of Its Expectation,” Journal of the American 
Statistical Association. December 1973, pp. 928-934. 
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log L = const. - 2 logOy + 8 a ,) - ^ 2 (~ - “. 

l \ y + ox, 

Again, note that the WLS and ML procedures are not the same. 10 A two-step 
WLS estimation for this model would proceed as follows. Compute the OLS 
estimates of a and p. Get the estimated residuals and regress the absolute val¬ 
ues of these residuals on x to get estimates of -y and 8. Then use WLS. 


Illustrative Example 

As an illustration, again consider the data in Table 5.1. We saw earlier that 
regressing the absolute values of the residuals on x (in Glejser’s tests) gave the 
following estimates: 

7 = -0.209 8 = 0.0512 

Now we regress yjw, on Mw, and xjw t (with no constant term) where w, = 
7 + 8x,. The resulting equation is 

= 0.4843(1/h>,) + 0.9176(x,/w,-) R 2 = 0.9886 

W, ( 0 . 1643 ) (0 0157 ) 

If we assume that cr; = 7 0 + 7,x, + y 2 xj, the two-step WLS procedure would 
be as follows. 

First, we regress uj on x, and xj. Earlier, using this regression we obtained 
the estimates as 

7o = 0.493 7, = -0.071 y 2 = 0.0037 

Next we compute 

wf = 0.493 - 0.071a, + 0.0037a? 
and regress yjwj on Mw, and x t lw h The results were 


— = 0.7296(l/w,) + 0.9052(a,/w,) R 2 = 0.9982 

W, ( 0 . 3302 ) (0 0199 ) 

The R 2 ’s in these equations are not comparable. But our interest is in estimates 
of the parameters in the consumption function 

• = a + Pa, 

Comparing the results with the OLS estimates presented in Section 5.2, we 
notice that the estimates of p are higher than the OLS estimates, the estimates 
of a are lower, and the standard errors are lower. 


l0 The ML estimation of this model is discussed in H. C. Rutemiller and D. A. Bowers, “Esti¬ 
mation in a Heteroscedastic Regression Model,” Journal of the American Statistical Associa¬ 
tion, June 1968. 
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5.5 Heteroskedasticity and the 
Use of Deflators 


There are two remedies often suggested and used for solving the heteroskedas¬ 
ticity problem: 


1. Transforming the data to logs. 

2. Deflating the variables by some measure of “size.” 


The first method often does reduce the heteroskedasticity in the error vari¬ 
ances, although there are other criteria by which one has to decide between 
the linear and the logarithmic functional forms. This problem is discussed in 
greater detail in Section 5.6. 

Regarding the use of deflators, one should be careful in estimating the equa¬ 
tion with the correct explanatory variables (as explained in the preceding sec¬ 
tion). For instance, if the original equation involves a constant term, one should 
not estimate a similar equation in the deflated variables. One should be esti¬ 
mating an equation with the reciprocal of the deflator added as an extra ex¬ 
planatory variable. As an illustration, consider the estimation of railroad cost 
functions by Griliches 11 where deflation was used to solve the heteroskedastic¬ 
ity problem. The variables are: C = total cost, M = miles of road, and X = 
output. 

If C = aM + bX, dividing by M gives C/M = a + b(X/M). But if the true 
relation is C = aM + bX + c, deflation leads to C/M = a + b{X/M) + c(l/M). 
For 97 observations using 1957-1969 averages as units, regressions were 



13,016 + 6.431 — 

( 6218 ) (0 871 ) M 

827 + 6.439 ^ + 3,065,000 -J- 

( 5115 ) (0 682 ) M ( 393 , 000 ) M 

-1.884M + 6.613X + 3676 

(2 906 ) ( 0 . 375 ) ( 4730 ) 


R 2 = 0.365 

R 2 = 0.614 
R 2 = 0.945 


The coefficient c is significant in the second equation but not in the third, and 
the coefficient of a is not significant in either the second or third equation. From 
this Griliches concludes that there is no evidence that M belongs in the equa¬ 
tion in any form. It appears in a significant form in the second equation only 
because the other variables were divided by it. 

Some other equations estimated with the same data are the following: 


C = 2811 + 6.39X R 2 = 0.944 

( 4524 ) (0 18 ) 


c 

Vm 


= 3805 

( 3713 ) 


1 

Vm 


+ 6.06 
(0 51 ) 


X 

Vm 


R 2 = 0.826 


"Z. Griliches, “Railroad Cost Analysis,” The Bell Journal of Economics and Management 
Science, Spring 1972. 
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The last equation is the appropriate one to estimate if C = a + bX + u and 
V(u) = Mu 2 . 

One important thing to note is that the purpose in all these procedures of 
deflation is to get more efficient estimates of the parameters. But once those 
estimates have been obtained, one should make all inferences—calculation of 
the residuals, prediction of future values, calculation of elasticities at the 
means, etc., from the original equation—not the equation in the deflated vari¬ 
ables. 

Another point to note is that since the purpose of deflation is to get more 
efficient estimates, it is tempting to argue about the merits of the different pro¬ 
cedures by looking at the standard errors of the coefficients. However, this is 
not correct, because in the presence of heteroscedasticity the standard errors 
themselves are biased, as we showed earlier. For instance, in the five equations 
presented above, the second and third are comparable and so are the fourth 
and fifth. In both cases if we look at the standard errors of the coefficient of X, 
the coefficient in the undeflated equation has a smaller standard error than the 
corresponding coefficient in the deflated equation. However, if the standard 
errors are biased, we have to be careful in making too much of these differ¬ 
ences. An examination of the residuals will give a better picture. 

In the preceding example we have considered miles M as a deflator and also 
as an explanatory variable. In this context we should mention some discussion 
in the literature on “spurious correlation” between ratios. 12 The argument sim¬ 
ply is that even if we have two variables X and Y that are uncorrelated, if we 
deflate both the variables by another variable Z, there could be a strong cor¬ 
relation between X/Z and Y/Z because of the common denominator Z. It is 
wrong to infer from this correlation that there exists a close relationship be¬ 
tween X and Y. Of course, if our interest is in fact the relationship between 
XIZ and Y/Z, there is no reason why this correlation need be called “spurious.” 
As Kuh and Meyer point out, “The question of spurious correlation quite ob¬ 
viously does not arise when the hypothesis to be tested has initially been for¬ 
mulated in terms of ratios, for instance, in problems involving relative prices. 
Similarly, when a series such as money value of output is divided by a price 
index to obtain a ‘constant dollar’ estimate of output, no question of spurious 
correlation need arise. Thus, spurious correlation can only exist when a hy¬ 
pothesis pertains to undeflated variables and the data have been divided 
through by another series for reasons extraneous to but not in conflict with the 
hypothesis framed as an exact, i.e., nonstochastic relation.” 

However, even in cases where deflation is done for reasons of estimation, we 
should note that the problem of “spurious correlation” exists only if we start 
drawing inferences on the basis of correlation coefficients when we should not 
be doing so. For example, suppose that the relationship we derive is of the 
form 


12 See E. Kuh and J. R. Meyer, “Correlation and Regression Estimates When the Data Are 
Ratios,” Econometrica, October 1955, pp. 400-416. 
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Y = aZ + PX + u (5.4) 

and we find that the residuals u are heteroscedastic with variance roughly pro¬ 
portional to Z 2 . Then we should not hesitate to divide (5.4) throughout by Z 
and estimate the regression equation 

Y X 

— - a + p— + u (5.5) 

where u' = w/Z has a constant variance or 2 . The estimates of a, p, and a 2 should 
be obtained from (5.5) and not from (5.4). Whether the correlation between 
Y/Z and X/Z is higher or lower than the correlation between Y and X is irrele¬ 
vant. The important point to note is that we cannot argue whether (5.4) or (5.5) 
is a better equation to consider by looking at correlations. As long as we do 
not base our inferences on correlations, there is nothing wrong with deflation 
in this case. It should also be noted that if (5.4) is not homogeneous (i.e., it 
involves a constant term), we end up with an equation of the form 

, Y l X 

% z *z p z 

which is different from (5.5). The equation will also be different if the variance 
of u is proportional not to Z 2 but to Z or some other function of Z. 

In actual practice, deflation may increase or decrease the resulting correla¬ 
tion. The algebra is somewhat tedious, but with some simplifying assumptions 
Kuh and Meyer derive the conditions under which the correlation between 
XIZ and YIZ is in fact less than that between X and Y. 

In summary, often in econometric work deflated or ratio variables are used 
to solve the heteroskedasticity problem. Deflation can sometimes be justified 
on pure economic grounds, as in the case of the use of “real” quantities and 
relative prices. In this case all the inferences from the estimated equation will 
be based on the equation in the deflated variables. However, if deflation is used 
to solve the heteroskedasticity problem, any inferences we make have to be 
based on the original equation, not the equation in the deflated variables. In 
any case, deflation may increase or decrease the resulting correlations, but this 
is beside the point. Since the correlations are not comparable anyway, one 
should not draw any inferences from them. 


Illustrative Example: The Density Gradient Model 

In Table 5.5 we present data on 

y = population density 

x = distance from the central business district 

for 39 census tracts on the Baltimore area in 1970. It has been suggested (this 
is called the “density gradient model”) that population density follows the re¬ 
lationship 
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Table 5.5 Data on Population Density in Different Census Tracts in the Baltimore 
Area in 1970" 


Observation 

>’ 

X 

Observation 

>’ 

X 

1 

18 , 640.0 

n 

21 

5 , 485.8 

6.748 

2 

38 , 275.0 

KL 

22 

3 , 416.5 

6.882 

3 


H 

23 

8 , 194.7 

6.948 

4 

21 , 969.0 

2.138 

24 

5 , 091.9 

6.948 

5 

9 , 573.7 

2.205 

25 

1 , 183.8 

7.082 

6 

13 , 751.0 


26 

4 , 157.9 

7.416 

7 

38 , 947.0 

3.675 

27 

2 , 158.3 

7.483 

8 

17 , 921.0 


28 

12 , 428.0 

7.617 

9 


4.276 ! 

29 

6 , 788.5 

7.750 

10 



30 

3 , 277.4 

7.750 

11 

6 , 781.1 

4.543 | 

31 

3 , 258.2 

7.951 

12 

8 , 246.2 


32 

5 , 491.3 

8.084 

13 

5 , 166.4 

4.944 

33 

865.02 

11.250 

14 

7 , 762.4 

5.211 

34 

340.69 

13.250 

15 

11 , 081.0 

5.345 

35 

507.03 

15.500 

16 


5.679 

36 

323.67 

18.000 

17 

13 , 753.0 

5.813 

37 

108.36 

19.000 

18 

7 , 492.4 

5.813 

38 

805.66 

23.000 

19 


5.879 

39 

156.84 

26.250 

20 


mm 





°y, density of population in the census tract: distance of the census tract from the central 
business district. 

Source: I would like to thank Kajal Lahiri for providing me with these data. These data 
formed the basis of the study in K. Lahiri and R. Numrich, “An Econometric Study of the 
Dynamics of Urban Spatial Structure.” Journal of Urban Economics, 1983, pp. 55-79. 


y = Ae p > 0 

where A is the density of the central business district. The basic hypothesis is 
that as you move away from the central business district population density 
drops off. 

For estimation purposes we take logs and write 

log y = log A - p* 

Adding an error term u we estimate the model 

y* = a — p* + u 

where y* = log y and a = log A. Estimation of this equation by OLS gave the 
following results: 

y* = 10.093 - 0.2395* R 2 = 0.803 

( 54 . 7 ) (- 12 . 28 ) 
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(Figures in parentheses are the /-values, not standard errors.) The /-values are 
very high and the coefficients a and (3 are significantly different from zero (with 
a significance level of less than 1%). The sign of (3 is negative, as expected. 
With cross-sectional data like these we expect heteroskedasticity, and this 
could result in an underestimation of the standard errors (and thus an overes¬ 
timation of the /-ratios). To check whethere there is heteroskedasticity, we have 
to analyze the estimated residuals A plot of w 2 against x i showed a positive 
relationship and hence Glejser’s tests were applied. 

Defining |w,| by z„ the following equations were estimated: 


Zi = yXi + v,. 

Z ; = yVx, + v,- 

1 

z, = 7 7 + V, 

Xj 

Zi = ^ + Vi 

We choose the specification that gives the highest R 2 [or equivalently the high¬ 
est /-value, since R 2 = / 2 /(/ 2 + d.f.) in the case of only one regressor]. The 
estimated regressions with /-values in parentheses were 


z, = 0.0445*, 

( 5 . 06 ) 

h = 0.1733V*7 

( 6 . 42 ) 


Zi 


Zi 


1.390 - 

( 4 . 50 ) \X,J 

1.038[ —1_ 

( 6 . 42 ) \V*/. 


All the /-statistics are significant, indicating the presence of heteroskedasticity. 
Based on the highest /-ratio, we chose the second specification (although the 
fourth specification is equally valid). Deflating throughout by gives the 
regression equations to be estimated as 


The estimates were 


* 

y,- 

VXi 


a — -p= + (3V*7 + error 
VXi 


a = 9.932 

( 47 . 87 ) 


and 0 = —0.2258 

(- 15 . 10 ) 


(Figures in parentheses are /-ratios.) The estimate of [3 is negative and highly 
significant. The estimated density of the central business district is given by 
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exp(a) = exp(9.932) = 20,577. Further analysis of these data is left as an ex¬ 
ercise. (What is the R 2 for the equation in the deflated variables? What can you 
conclude?) 


*5.6 Testing the Linear Versus Log-Linear 
Functional Form 


As we mentioned in the introduction, sometimes equations are estimated in log 
form to take care of the heteroskedasticity problem. In many cases the choice 
of the functional form is dictated by other considerations like convenience in 
interpretation and some economic reasoning. For instance, if we are estimating 
a production function, the linear form 

X = a + p,£ + p 2 X 

where X is the output, L the labor, and K the capital, implies perfect substitut¬ 
ability among the inputs of production. On the other hand, the logarithmic form 

log X = a + p, log L + p 2 l°g K 

implies a Cobb-Douglas production function with unit elasticity of substitution. 
Both these formulations are special cases of the CES (constant elasticity of 
substitution) production function. 

For the estimation of demand functions the log form is often preferred be¬ 
cause it is easy to interpret the coefficients as elasticities. For instance, 

log Q = a + 0, log P + 0 2 log Y 

where Q is the quantity demanded, P the price, and Y the income, implies that 
Pi is the price elasticity and p 2 is the income elasticity. A linear demand func¬ 
tion implies that these elasticities depend on the particular point along the de¬ 
mand curve that we are at. In this case we have to consider some methods of 
choosing statistically between the two functional forms. 

When comparing the linear with the log-linear forms, we cannot compare the 
R 2 's because R 2 is the ratio of explained variance to the total variance and the 
variances of y and log y are different. Comparing R 1 's in this case is like com¬ 
paring two individuals A and B, where A eats 65% of a carrot cake and B eats 
70% of a strawberry cake. The comparison does not make sense because there 
are two different cakes. 

The Box-Cox Test 

One solution to this problem is to consider a more general model of which both 
the linear and log-linear forms are special cases. Box and Cox 13 consider the 
transformation 


n G. E. P. Box and D. R. Cox, “An Analysis of Transformations” (with discussion). Journal of 
the Royal Statistical Society, Series B, 1962, pp. 211-243. 
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I -- for X # 0 

X (5.6) 

log y for X = 0 

This transformation is well defined for all y > 0. Also, the transformation is 
continuous since 

y x - 1 

lim- = log X 

X—* 0 ^ 

Box and Cox consider the regression model 

y,(\) = 3x, + u, (5.7) 

where u, ~ IN(0, a 2 ). For the sake of simplicity of exposition we are consid¬ 
ering only one explanatory variable. Also, instead of considering x, we can 
consider x,(X). For X = 0 this is a log-linear model, and for X = 1 this is a linear 
model. 

There are two main problems with the specification in equation (5.7): 

1. The assumption that the errors u, in (5.7) are IN(0, a 2 ) for all values of X 
is not a reasonable assumption. 

2. Since y > 0, unless X = 0 the definition of y(X) in (5.6) imposes some 
constraints on y(X) that depend on the unknown X. Since y > 0, we have, 
from equation (5.6), 

y(X) > — - if X > 0 and y(X) < — i if X < 0 
X X 

However, we will ignore these problems and describe the Box-Cox method. 

Based on the specification given by (5.7) Box and Cox suggest estimating 
X by the maximum likelihood (ML) method. We can then test the hypotheses: 
X = 0 and X = 1. If the hypothesis X = 0 is accepted, we use log y as the 
explained variable. If the hypothesis X = 1 is accepted, we use y as the ex¬ 
plained variable. A problem arises only if both hypotheses are rejected or both 
accepted. In this case we have to use the estimated X, and work with y(X). 

The ML method suggested by Box and Cox amounts to the following pro¬ 
cedure: 14 

l. Divide each y by the geometric mean of the y’s. 

2. Now compute y(X) for different values of X and regress it on x. Compute 
the residual sum of squares and denote it by cr(X). 

3. Choose the value of X for which d 2 (X) is minimum. This value of X is the 
ML estimator of X. 

As a special case, consider the problem of choosing between the linear and 
log-linear models: 


l4 G. S. Maddala, Econometrics (New York: McGraw-Hill. 1977), pp. 316-317. 



222 


5 HETEROSKEDASTICITY 


y = a + (ix + u 

and 

logy = a' + 0'x + u 

What we do is first divide each y, by the geometric mean of the y’s. Then we 
estimate the two regressions and choose the one with the smaller residual sum 
of squares. This is the Box-Cox procedure. 

We will now describe the two other tests that are based on artificial regres¬ 
sions. 

The BM Test 

This is the test suggested by Bera and McAleer. 15 Suppose the log-linear and 


linear models to be tested are given by 


H 0 : log y, = 0 q + 0,x ( + « 0 , 

U(k ~ IN(t^) 

and 


Hi- y, = 0o + 0t*» + u n 

u u ~ IN(cti) 


The BM test involves three steps. 


Step 1. Obtain the predicted values log St and y, from the two equations, 
respectively. The predicted value of y, from the log-linear equation is 
exp(log y,). The predicted value of log y, from the linear equation is log y,. 

Step 2. Compute the artificial regressions: 

exp(log St) = Po + Pi*/ + v„ 
and 


log y, = 0o + 0i x, + v 0( 

Let the estimated residuals from these two regression equations be v„ 
and t> respectively. 

Step 3. The tests for H 0 and H, are based on 0 O and 0, in the artificial regres¬ 
sions: 


log y, = 00 + 01 *, + Vi, + £, 


and 


y, = 0o + 0,*, + 0,v 0 , + e, 

We use the usual /-tests to test these hypotheses. If 0 O = 0 is accepted, 
we choose the log-linear model. If 0, =0 is accepted, we choose the linear 
model. A problem arises if both these hypotheses are rejected or both are 
accepted. 

■’A. K. Bera and M. McAleer, “Further Results on Testing Linear and Log-Linear Regression 
Models.” paper presented at the SSRC Econometric Group Conference on Model Specification 
and Testing, Warwick, England, 1982. 
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The PE Test 

The PE test 16 also uses artificial regressions. It involves only two steps. Step 1 
is the same as in the BM test. 

Step 2. Test 8 0 = 0 and 8, = 0 in the artificial regressions: 

logy, = p 0 + p,x, + 8 0 [y, - exp(log}>,)] + e, 


and 

y, = Po + PA + 0|[log;p, - logy,] + e,. 

There are many other tests for this problem of choosing linear versus log-linear 
forms. 17 The three tests mentioned here are the easiest to compute. We are not 
presenting an illustrative example here. The computation of the Box-Cox, BM, 
and PE tests for the data in Tables 4.7 and 5.5 is left as an exercise. 


Summary 


1. If the error variance is not constant for all the observations, this is known 
as the heteroskedasticity problem. The problem is informally illustrated with 
an example in Section 5.1. 

2. First, we would like to know whether the problem exists. For this purpose 
some tests have been suggested. We have discussed the following tests: 

(a) Ramsey’s test. 

(b) Glejser’s tests. 

(c) Breusch and Pagan’s test. 

(d) White’s test. 

(e) Goldfeld and Quandt’s test. 

(f) Likelihood ratio test. 

Some of these tests have been illustrated with examples (see Section 5.2). Oth¬ 
ers have been left as exercises. There are two data sets (Tables 4.7 and 5.5) 
that have been provided for use by students who can experiment with these 
tests. 

3. The consequences of the heteroskedasticity problem are 

(a) The least squares estimators are unbiased but inefficient. 

(b) The estimated variances are themselves biased. 

I6 J. G. Mackinnon, H. White, and R. Davidson, “Tests for Model Specification in the Presence 
of Alternative Hypotheses: Some Further Results,” Journal of Econometrics, Vol. 21, 1983, 
pp. 53-70. 

,7 For instance, L. G. Godfrey and M. R. Wickens, “Testing Linear and Log-Linear Regressions 
for Functional Form,” Review of Economic Studies, 1981, pp. 487-496, and R. Davidson and 
J. G. Mackinnon, “Testing Linear and Log-Linear Regressions Against Box-Cox Alterna¬ 
tives,” Canadian Journal of Economics, 1985, pp. 499-517. 
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If the heteroskedasticity problem is detected, we can try to solve it by the use 
of weighted least squares. Otherwise, we can at least try to correct the error 
variances (since the estimators are unbiased). This correction (due to White) is 
illustrated at the end of Section 5.3. 

4. There are three solutions commonly suggested for the heteroskedasticity 
problem: 

(a) Use of weighted least squares. 

(b) Deflating the data by some measure of “size.” 

(c) Transforming the data to the logarithmic form. 

In weighted least squares, the particular weighting scheme used will depend on 
the nature of heteroskedasticity. Weighted least squares methods are illustrated 
in Section 5.4. 

5. The use of deflators is similar to the weighted least squares method, al¬ 
though it is done in a more ad hoc fashion. Some problems with the use of 
deflators are discussed in Section 5.5. 

6. The question of estimation in linear versus logarithmic form has received 
considerable attention during recent years. Several statistical tests have been 
suggested for testing the linear versus logarithmic form. In Section 5.6 we dis¬ 
cuss three of these tests: the Box-Cox test, the BM test, and the PE test. All 
are easy to implement with standard regression packages. We have not illus¬ 
trated the use of these tests. This is left as an exercise. It would be interesting 
to see which functional form is chosen and whether the heteroskedasticity 
problem exists for the functional form chosen. 

7. Note that the tests discussed in Section 5.6 start by assuming homoske- 
dastic errors for both functional forms. 


Exercises 


1. Define the terms “heteroskedasticity” and “homoskedasticity.” Explain the 
effects of heteroskedasticity on the estimates of the parameters and their 
variances in a normal regression model. 

2. Explain the following tests for homoskedasticity. 

(a) Ramsey’s test. 

(b) Goldfeld and Quandt’s test. 

(c) Glejser’s test. 

(d) Breusch and Pagan’s test. 

Illustrate each of these tests with the data in Tables 4.7 and 5.5. 

3. Indicate whether each of the following statements is true (T), false (F), or 
uncertain (U), and give a brief explanation. 

(a) Heteroskedasticity in the errors leads to biased estimates of the 
regression coefficients and their standard errors. 

(b) Deflating income and consumption by the same price results in a 
higher estimate for the marginal propensity to consume. 




EXERCISES 


225 


(c) The correlation between two ratios which have the same denominator 
is always biased upward. 

4. Apply the following tests to choose between the linear and log-linear regres¬ 
sion models with the data in Tables 4.7 and 5.5. 

(a) Box-Cox test. 

(b) BM test. 

(c) PE test. 

5. In the model 


yi, = a n x „ + a l2 x 2t + 
y-n — a 2l x ll + fl 22*2/ "b U 2l 

you are told that 

a U + fl 12 = fl 21 
, a \l ~ a l 2 — a 22 

u u ~ IN(0, or 2 ), u 2 , ~ IN(0, 4cr 2 ), and u u and u 2 , are independent. Explain 
how you will estimate the parameters a n , a n , a 2 „ a 22 , and cr 2 . 

6. Explain how you will choose among the following four regression models. 


y = 

«i 

+ 

Pi* + 

u l 

y = 

«2 

+ 

P 2 log 

X + u 2 

- logy = 


+ 

P 3 * + 

Ul 

. logy = 

«4 

+ 

p 4 log 

X + m 4 

7. In the linear regression model 





y, = 

= a 

+ 

p*, + 

U, 


the errors u, are presumed to have a variance depending on a variable z,. 
Explain how you will choose among the following four specifications. 

var(«,) = a 2 

1. var(«,) = a 2 z, 

var(w,) = a 2 zj 
var(w,) = cr 2 z 2 

8. In a study of 27 industrial establishments of varying size, y = the number 
of supervisors and x = the number of supervised workers, y varies from 30 
to 210 and x from 247 to 1650. The results obtained were as follows: 


Variable 

Coefficient 

SE 

t 

X 

0.115 

0.011 

9.30 

Constant 

14.448 

9.562 

1.51 

a 

II 

s = 21.73 

R 2 = 0.776 



After the estimation of the equation and plotting the residuals against x, it 
was found that the variance of the residuals increased with x. Plotting the 
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residuals against Mx showed that there was no such relationship. Hence the 
assumption made was 

var(«,) = a 2 xj 

The estimated equation was 


- = 0.121 + 3.803(l/x) R 2 = 0.03 

X (0 009) (4 570) 

In terms of the original variables, we have 

y = 3.803 + 0.121* 

The estimation gave the following results: 


Variable 

Coefficient 

SE 

t 

X 

0.121 

0.009 

13.44 

Constant 

3.803 

4.570 

0.832 

n = 27 

s = 22.577 

R 2 = 

0.7587 


(a) An investigator looks at the drop in R 2 and concludes that the first 
equation is better. Is this conclusion valid? 

(b) What would the equation to be estimated be if var(«,) = cr 2 x, instead 
of cr 2 xj? How would you determine which of these alternative hy¬ 
potheses is the better one? 

(c) Comment on the computation of R 2 from the transformed equation 
and the R 2 from the equation in terms of the original variables. 

9. In discussion of real estate assessment, it is often argued that the higher- 
priced houses get assessed at a lower proportion of value than the lower- 
priced houses. To determine whether such inequity exists, the following 
equations are estimated: 

1. A, = a + (35, + u, 

2. A,IS, = 7 + 8 5, + u] 

3. log A, = e + X log 5, + u, 

where A, is the assessment for property / and 5, is the observed selling price. 
In a sample of 416 houses in King County, Washington during the period 
1977-1979, the following results were obtained: 


1. A, = 7505.40 + 0.33825, R 2 = 0.597 

(559.2) (0.0136) standard errors 

(13.42) (24.79) r-ralios 


2. AjSj = 0.7374 - 4.5714 x 10- 6 5, 

(0 0144) (3 5005 x 10 7 ) 

51 38 - 13 06 


R 2 = 0.2917 

standard errors 
/-ratios 


3. log A, = 2.8312 + 0.6722 log 5, R 2 = 0.6547 

(0.2513) (0 0240) 


It has been suggested that it is more appropriate to answer this question by 

estimating a reverse regression: 
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4. S, = ■Yo + 7,A, + T), 

5. A,/S , = (3o + PA + ti', 

The estimation now gave the results: 

6. S, = 2050.07 + 1.7669A, R 2 = 0.597 

(1527 93) (0 0713) standard errors 

(1 3417) (24 79) (-ratios ip 

7. A,iS, = 0.5556 + 3.8288 x 1(U 7 A, R 2 = 0.0004 

(0 0203) (9 506 x 10 7 ) standard errors 

(27 26) (0 403) (-ratios 

The average value of A/S was 5.6439. 

(a) Interpret what the coefficients in each of these equations mean to an¬ 
swer the question of whether there is inequity in real estate assess¬ 
ment. 

(b) Given some arguments as to why equations (4) and (5) may be more 
appropriate to look at than equations (1) to (3). 

(c) Also, explain why equations (2) and (5) may be more appropriate than 
equations (1) and (4) respectively. 

(The results are from L. A. Kochin and R. W. Parks, “Testing for 
Assessment Uniformity: A Reappraisal,” Property Tax Journal, 
March 1984, pp. 27-54.) 


Appendix to Chapter 5 

Generalized Least Squares 

In Chapter 4 we assumed that the errors were independent with a common 
variance o 2 or E(uu') = Icr 2 . This assumption is relaxed in Chapters 5 and 6. 
We start with the model 


y = Zp + u E(uu') = Cl 

where Cl is an arbitrary positive definite matrix. Premultiplying by Cl~ 112 , we 
get 

y* = X*p 4- u* 

where 

y* = Cl ~ ll2 y 

X* = Cl~ m X 
u* = Cl~ m u 

Then E(u*u*') - Cl~~ m E(uu') Cl~ m = /. Hence by the result in the Appendix 
to Chapter 4, we get the BLUE of p as 

Pols = (**'X*)- , (X*y) = (X'fT 
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We use the subscript GLS to denote “generalized least squares.” The variance 
of this estimator is given by 

V(p GLS ) = (**'**)-' = (X'n-'X)' c 

v 

By contrast, the ordinary least squares estimator is given by 

P OLS = (X'X) l X’y = p + (X’X)'X’u 

Since E(u) = 0 this estimator is still unbiased. But its variance now is given by 

V(Poi.s) = {X’X)-'X'E{uu')X{X'X)' = (X'X) 'X'SlXiX'X)' 

This is the general expression corresponding to equation (5.2) in Section 5.3. 
In this chapter we considered the case where A is a diagonal matrix with the 
ith diagonal element of. It is important to note that p 0LS is not necessarily in¬ 
efficient if fl 5 ^ la 2 . Rao 18 showed that a necessary and sufficient condition that 
the OLS and GLS methods are equivalent is that A be of the form 

fl = XCX’ + ZDZ’ + la 2 


where Z is a matrix such that X'Z = 0 and C and D are arbitrary nonnegative 
definite matrices. As an example consider the following model: 


y, = a + bx, + u, i — 1, 2, . . . , n 
var(M,) = 1 co v(u,Uj) = p for / ^ j 

Thus the errors u, are not independent. They are equicorrelated. In matrix form 
we can write this model as 


where 


y - Xp + u 


1 xn 

1 x 2 



E(uu') 


L 1 

n p p 


:l 


= (1 — p)/ + pee' 


• P P • - • 1. 


where e is an n x 1 vector with all l’s (first column of X). Thus 


H = XCX' + I o 2 

. Hence, in this model, even if the errors 
are correlated and fl ^ lor 2 , the OLS and GLS estimators are identical. 


with or 2 = (1 - p) and C 




**C. R. Rao, “Least Squares Theory Using an Estimated Dispersion Matrix,” Proceedings of 
the Fifth Berkeley Symposium (Berkeley, Calif.: University of California Press, 1967), Vol. 1, 
pp.355-372. 
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6.2 Durbin-Watson Test 

6.3 Estimation in Levels Versus First Differences 

6.4 Estimation Procedures with Autocorrelated Errors 

6.5 Effect of AR(1) errors on OLS Estimates 

6.6 Some Further Comments on the DW Test 

6.7 Tests for Serial Correlation in Models with Lagged 
Dependent Variables 

6.8 A General Test for Higher-Order Serial Correlation: 
The LM Test 

6.9 Strategies When the DW Test Statistic Is Significant 
*6.10 Trends and Random Walks 

*6.11 ARCH Models and Serial Correlation 
Summary 
Exercises 


6.1 Introduction 


In Chapter 5 we considered the consequences of relaxing the assumption that 
the variance of the error term is constant. We now come to the next assumption 
that the error terms in the regression model are independent. In this chapter 
we study the consequences of relaxing this assumption. 
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There are two situations under which error terms in the regression model can 
be correlated. In cross-section data it can arise among contiguous units. For 
instance, if we are studying consumption patterns of households, the error 
terms for households in the same neighborhood can be correlated. This is be¬ 
cause the error term picks up the effect of omitted variables and these variables 
tend to be correlated for households in the same neighborhood (because of the 
“keeping up with the Joneses” effect). Similarly, if our data are on states, the 
error terms for contiguous states tend to be correlated. All these examples fall 
in the category of spatial correlation. In this chapter we will not be concerned 
with this type of correlation. Some of the factors that produce this type of 
correlation among the error terms can be taken care of by the use of dummy 
variables discussed in Chapter 8. 

What we will be discussing in this chapter is the correlation between the 
error terms arising in time series data. This type of correlation is called auto¬ 
correlation or serial correlation. The error term u, at time period t is correlated 
with error terms u l+l , u l+2 , . . ■ and u, ,, w,„ 2 , . . . and so on. Such correlation 
in the error terms often arises from the correlation of the omitted variables that 
the error term captures. 

The correlation between u, and u,_ k is called an autocorrelation of order k. 
The correlation between u, and u, , is the first-order autocorrelation and is 
usually denoted by p,. The correlation between u, and u,_ 2 is called the second- 
order autocorrelation and is denoted by p 2 and so on. There are (n - 1) such 
autocorrelations if we have n observations. However, we cannot hope to esti¬ 
mate all of these from our data. Hence we often assume that these (n — 1) 
autocorrelations can be represented in terms of one or two parameters. 

In the following sections we discuss how to 

1. Test for the presence of serial correlation. 

2. Estimate the regression equation when the errors are serially correlated. 

6.2 Durbin—Watson Test 


The simplest and most commonly used model is one where the errors u, and 
u, , have a correlation p. For this model one can think of testing hypotheses 
about p on the basis of p, the correlation between the least squares residuals u, 
and A commonly used statistic for this purpose (which is related to p) is 
the Durbin-Watson (DW) statistic, which we will denote by d. It is defined as 

i w , - a ,-,) 2 




where u, is the estimated residual for period t. We can write d as 

S + S «?-! - 2 2 &A -1 


d = 
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Since 2 w? and 2 #?-1 are approximately equal if the sample is large, we have 
d — 2(1 - p). If p = +1, then d = 0, and if p = — 1, then d = 4. We have 
d = 2 if p = 0. If d is close to 0 or 4, the residuals are highly correlated. 

The sampling distribution of d depends on the values of the explanatory vari¬ 
ables and hence Durbin and Watson' derived upper ( d v ) limits and lower (d,) 
limits for the significance levels for d. There are tables to test the hypothesis 
of zero autocorrelation against the hypothesis of first-order positive autocor¬ 
relation. (For negative autocorrelation we interchange d L and d v .) 


If d < d L , we reject the null hypothesis of no autocorrelation. 

If d > d v , we do not reject the null hypothesis. 

If d L < d < d v , the test is inconclusive. 

Hannan and Terrell 2 show that the upper bound of the DW statistic is a good 
approximation to its distribution when the regressors are slowly changing. 
They argue that economic time series are slowly changing and hence one can 
use d v as the correct significance point. 

The significance points in the DW tables at the end of the book are tabulated 
for testing p = 0 against p > 0. If d > 2 and we wish to test the hypothesis 
p = 0 against p < 0, we consider 4 — d and refer to the Durbin-Watson tables 
as if we are testing for positive autocorrelation. 

Although we have said that d = 2(1 - p) this approximation is valid only in 
large samples. The mean of d when p = 0 has been shown to be given approx¬ 
imately by (the proof is rather complicated for our purpose) 


E(d) = 2 + 


2 (k - 1 ) 
n — k 


where k is the number of regression parameters estimated (including the con¬ 
stant term) and n is the sample size. Thus, even for zero serial correlation, the 
statistic is biased upward from 2. If k = 5 and n = 15, the bias is as large as 
0.8. We illustrate the use of the DW test with an example. 


Illustrative Example 

Consider the data in Table 3.11. The estimated production function is 
log* = -3.938 + 1.451 log L x + 0.384 log K, 

(0 237) (0 083) (0 048) 

R 2 = 0.9946 DW = 0.88 p = 0.559 

Referring to the DW tables with k' = 2 and n = 39 for the 5% significance 
level, we see that d L = 1.38. Since the observed d = 0.858 < d L , we reject the 
hypothesis p = 0 at the 5% level. 


'J. Durbin and G. S. Watson, “Testing for Serial Correlation in Least Squares Regression,” 
Biometrika, 1950, pp. 409-428; 1951, pp. 159-178. 

2 E. J. Hannan and R. D. Terrell, “Testing for Serial Correlation After Least Squares Regres¬ 
sion,” Econometrica, 1966, pp. 646-660. 
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Although the DW test is the most commonly used test for serial correlations, 
it has several limitations. 

1. It tests for only first-order serial correlation. 

2. The test is inconclusive if the computed value lies between d L and d v . 

3. The test cannot be applied in models with lagged dependent variables. 

At this point it would be distracting to answer all these criticisms. We will 
present answers to these points in later sections of this chapter. First we dis¬ 
cuss some simple solutions to the serial correlation problem. 

6.3 Estimation in Levels Versus 
First Differences 

If the DW test rejects the hypothesis of zero serial correlation, what is the next 
step? 

In such cases one estimates a regression by transforming all the variables by 
p-differencing, that is, regress y, — py, , on x, — px, , where p is the esti¬ 
mated p. However, since p is subject to sampling errors, one other alternative 
that is followed if the DW statistic d is small is to use a first-difference equa¬ 
tion. In fact, a rough rule of thumb is: Estimate an equation in first differences 
whenever the DW statistic is < R 2 . In first difference equations, we regress 
(y, - y, _ |) on (x, — x,_ l ) (with all the explanatory variables differences simi¬ 
larly). The implicit assumption is that the first differences of the errors (u, - 
«,_i) are independent. For instance, if 


y, = a + $x, + u, 
is the regression equation, then 


tt-i = a + 0x,„ t + m,_ i 


and we have by subtraction 

C y, ~ y ( -i) = PC*, - x,-\) + 1) 

If the errors in this equation are independent, we can estimate the equation by 
OLS. However, since the constant term a disappears under subtraction, we 
should be estimating the regression equation with no constant term. Often, we 
find a constant term also included in regression equations with first differences. 
This procedure is valid only if there is a linear trend term in the original equa¬ 
tion. If the regression equation is 

y, = a + 8/ + + u, 


then 


y,_ i = a + 8(1 — 1) + px,„, + 
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and on subtraction we get 

O’, - y,~ 1) = 6 + PO, - *,~i) + (u, - «,-i) 

which is an equation with a constant term. 

When comparing equations in levels and first differences, one cannot com¬ 
pare the R 2 ’s because the explained variables are different. One can compare 
the residual sum of squares but only after making a rough adjustment. Note 
that if var(w,) = a 2 , then the variance of the error term in the first difference 
equation is 

var(«, - = var(«,) + var(w,_,) - 2 cov(k„m, ,) 

=* ct 2 + a 2 - 2a 2 p 
= 2 ct 2 (1 - p) 

where p is the correlation coefficient between u, and u, ,. Since the residual 
sum of squares divided by the appropriate degrees of freedom gives a consis¬ 
tent estimator for the error variance, the two residual sums of squares can be 
made roughly comparable if we multiply the residual sum of squares from the 
levels equation by 

where k is the number of regressors. If p is an estimate of p from the levels 
equation, since p — (2 — d)!2 where d is the DW test statistic, we get 2(1 — 
p) = d. Thus, we can multiply the residual sum of squares from the levels 
equation by 



or if n is large, just by d. For instance, if the residual sum of squares is, say, 
1.2 by the levels equation, and 0.8 by the first difference equation and n = 11, 
k = 1, DW = 0.9, then the adjusted residual sum of squares with the levels 
equation is (9/10)(0.9)(1.2) = 0.97 which is the number to be compared with 
0.8. 

All this discussion, however, assumes that there are no lagged dependent 
variables among the explanatory variables. If there are lagged dependent vari¬ 
ables in the equation, then the estimators of the parameters are not consistent 
and the above arguments do not hold. 

Since we have comparable residual sum of squares, we can get the compa¬ 
rable R 2 's as well, using the relationship RSS = S vv (l - R 2 ): 

Let 

R 2 = R 2 from the first difference equation 
RSS 0 = residual sum of squares from the levels equation 
RSSi = residual sum of squares from the first difference equation 



234 


6 autocorrelation 


R 2 d = comparable R 1 from the levels equation 

Then 



RSS 0 ( n-k- 1 \ 
RSS, \ n-k d ) 


An alternative formula by Harvey which does not contain the last term will be 
presented after some illustrative examples. 


Some Illustrative Examples 

Consider the simple Keynesian model discussed by Friedman and Meiselman. 3 
The equation estimated in levels is: 

C, = a + (3 A, + e, t — 1, 2, . . . , T 

where C, = personal consumption expenditure (current dollars) 

A, = autonomous expenditures (current dollars) 

The model fitted for the 1929-1939 period gave 4 : 

1. C, = 58,335.9 + 2.498 A , 

(0 312) 

R 2 = 0.8771, DW = 0.89, RSS = 11,943 x 10 4 

2. AC, = 1.993AA, 

(0 324) 

R 2 = 0.8096, DW = 1.51, RSS = 8387 x 10 4 

(Figures in parentheses are standard errors.) There is a reduction in the R 2 but 
the R 2 values are not comparable. The equation in first differences is better 
because of the larger DW statistic and lower residual sum of squares than for 
the equation in the levels (even after the adjustments described). 

For the production function data in Table 3.11 the first difference equation is 

A log X = 0.987A log L, + 0.502 A log K { 

(0.158) (0 134) 

R 2 = 0.8405 DW = 1.177 RSS = 0.0278 

The comparable figures for the levels equation reported earlier in Chapter 4, 
equation (4.24) are 

R 2 = 0.9946 DW = 0.858 RSS = 0.0434 


’M. Friedman and D. Meiselman, “The Relative Stability of Monetary Velocity and the Invest¬ 
ment Multiplier in the U.S., 1897-1958,” in Stabilization Policies (Commission on Money and 
Credit) (Englewood Cliffs, N.J.: Prentice Hall. 1963). 

4 A. C. Harvey, “On Comparing Regression Models in Levels and First Differences,” Interna¬ 
tional Economic Review, Vol. 21, No. 3, October 1980, pp. 707-720. 
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Again, even though the R 2 is larger for the equation in levels, the equation in 
first differences is better than the equation in levels, because it gives a lower 
RSS (even after the adjustments described) and a higher DW statistic. The 
estimate of returns to scale is 0.987 + 0.502 = 1.489 in the first difference 
equation compared to 1.451 + 0.384 = 1.835 in the levels equation. 

We can also compute R 2 D , the comparable R 2 from the equation in levels and 
see how it compares with Rj, the R 2 from the equation in first differences. In 
the example with the Friedman-Meiselman data the value of R 2 D is given by 


d 2 _ i _ RSS„ (n 
D RSS, 


d( 1 - R?) 


1 


11.943 / 9 


8.387 
= 1 - 0.2172 


-J(0.89)(l - 0.8096) 
= 0.7828 


This is to be compared with R] = 0.8096 from the equation in first differences. 
For the production function data we get 

R ’° “ 1 - (til) (i ) t0 - 858)(1 - °- 84 »« 

= 1 - 0.2079 = 0.7921 


This is to be compared with R 2 = 0.8405 from the equation in first differences. 
Harvey 5 gives a different definition of R 2 D . He defines it as: 

Rl = 1 - fg(i “ R ' ] 

This does not adjust for the fact that the error variances in the levels equations 
and the first difference equation are not the same. The arguments for his sug¬ 
gestion are given in his paper. In the example with the Friedman-Meiselman 
data his measure of R 2 D is given by 

119 430 

Rl * 1 - - 0.8096) = 0.7289 

Although R 2 d cannot be greater than 1, it can be negative. This would be the 
case when ^ (Ay, - Ay) 2 < RSS 0 , that is, when the levels model is giving a 
poorer explanation than the naive model, which says that Ay, is a constant. 

Usually, with time-series data, one gets high R 2 values if the regressions are 
estimated with the levels y, and x, but one gets low R 2 values if the regressions 
are estimated in first differences (y, - y,_,) and (x, — *,_,). Since a high R 2 is 
usually considered as proof of a strong relationship between the variables under 
investigation, there is a strong tendency to estimate the equations in levels 
rather than in first differences. This is sometimes called the “R 2 syndrome.” 


’Harvey, “On Comparing Regression Models,” p. 711. 
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However, if the DW statistic is very low, it often implies a misspecified equa¬ 
tion, no matter what the value of the R 2 is. In such cases one should estimate 
the regression equation in first differences and if the R 1 is low, this merely 
indicates that the variables y and x are not related to each other. Granger and 
Newbold 6 present some examples with artificially generated data where y, x, 
and the error u are each generated independently so that there is no relationship 
between y and x, but the correlations between y, and y, ,, x, and x,_„ and it, 
and u, . | are very high. Although there is no relationship between y and x the 
regression of y on x gives a high R 2 but a low DW statistic. When the regression 
is run in first differences, the R 2 is close to zero and the DW statistic is close 
to 2, thus demonstrating that there is indeed no relationship between y and x 
and that the R 2 obtained earlier is spurious. Thus regressions in first differences 
might often reveal the true nature of the relationship between y and x. (Further 
discussion of this problem is in Sections 6.10 and 14.7.) 

Finally, it should be emphasized that all this discussion of the Durbin-Wat- 
son statistic, first differences, and quasi-first differences is relevant only if we 
believe that the correlation structure between the errors can be entirely de¬ 
scribed in terms of p, the correlation coefficient between u, and u ,This may 
not always be the case. We will discuss some general formulations of the cor¬ 
relation structure of the errors in Section 6.9 after we analyze the simple case 
thoroughly. Also, even if the correlation structure can be described in terms of 
just one parameter, this need not be the correlation between u, and For 
instance, suppose that we have quarterly data; then it is possible that the errors 
in any quarter this year are most highly correlated with the errors in the cor¬ 
responding quarter last year rather than the errors in the preceding quarter; 
that is, u, could be uncorrelated with u, , but it could be highly correlated with 
m,_ 4 . If this is the case, the DW statistic will fail to detect it. What we should 
be using is a modified statistic defined as 

S («, ~ «/-4> 2 

d * = 

Also, instead of using first differences or quasi first differences in the regres¬ 
sions, we should be using fourth differences or quasi fourth differences, that 
is, y, — y, 4 and x, - x,_ 4 or y, - py, 4 and x, - px, . 4 , where p is the corre¬ 

lation coefficient between the estimated residuals u, and w,_ 4 . 


6.4 Estimation Procedures with 
Autocorrelated Errors 


In Section 6.3 we considered estimation in first differences. We will now con¬ 
sider estimation with quasi first differences, that is, regressing y, — py, , on 
x, - px,_,. As we said earlier, we will be discussing the simplest case where 

6 C. W. J. Granger and P. Newbold, “Spurious Regressions in Econometrics,” Journal of Econ¬ 
ometrics, Vol. 2, No. 2, July 1976, pp. 111-120. 
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the entire correlation structure of the errors u, can be summarized in a single 
parameter p. This would be the case if the errors u, are first-order autoregres¬ 
sive, that is, 

u, = p« r _i + e, (6.1) 

where e, are serially uncorrelated, with mean zero and common variance ex*. 
Equation (6.1) is called an autoregression because it is the usual regression 
model with u, regressed on u, ,. It is called first-order autoregression because 
u, is regressed on its past with only one lag. If there are two lags, it is called 
second-order autoregression. If there are three lags, it is called third-order au¬ 
toregression, and so on. If the errors u, satisfy equation (6.1), we say u, are 
AR(1) (i.e., autoregressive of first order). If the errors u, satisfy the equation 

u, = p,w,. | + p 2 «,_ 2 + e, 

then we say that u, are AR(2), and so on. 

Now we will derive var(w,) and the correlations between u, and lagged values 
of u,. From (6.1) note that u, depends on e, and u, , depends on e, , and 

u,^ 2 , and so on. Thus u, depends on e„ e,_,, e ,_ 2> . . . . Since e, are serially 
independent, and u, , depends on e, ,, e,_ 2 and so on, but not e„ we have 

E{ue,) = 0 

Since E(e t ) = 0, we have E(u,) = 0 for all t. 

If we denote var(w,) by cx 2 , we have 

cx 2 = var(«,) = £(«?) 

= £(p«,-i + e,f 

— p 2 o 2 + cx 2 since cov(w,_. |, e,) = 0 

Thus we have 



This gives the variance of u, in terms of variance of e, and the parameter p. 

Let us now derive the correlations. Denoting the correlation between u, and 
u,_ s (which is called the correlation of lag s) by p 4 , we get 

E{um,-s) = v 2 Ps 

But 


Hence 


E{u,u, ^) = pE(u,_, h,_J + E(e t u,_ s ) 


Pj = p • p s -i + 0 


or 


P s = P • P^-i 
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Since p 0 = 1 we get by successive substitution 

Pi = P. P? = P 2 > Ps = P 3 . • • • 

Thus the lag correlations are all powers of p and decline geometrically. 

These expressions can be used to derive the covariance matrix of the errors 
and using what is known as generalized least squares (GLS). 7 We will not de¬ 
rive the expression for GLS here but will outline the minor changes that it 
implies. Consider the model 

y, = a + Pjc, + u, t = 1, 2, . . . , T (6.2) 

u, = p«,_| + e, (6.3) 

Except for the treatment of the first observation in the case of AR(1) errors 
as in (6.3), and the treatment of the first two observations in the case of AR(2) 
errors, and so on, the GLS procedure amounts to the use of transformed data, 
which are obtained as follows. 8 

Lagging (6.2) by one period and multiplying it by p, we get 

P^-i = <*P + Pp-*,-i + P«,-i (6-4) 

Subtracting (6.4) from (6.2) and using (6.3), we get 

y, - py,~, = a(l - p) + PC*, - p*,-i) + e, (6.5) 

Since e, are serially independent with a constant variance o>, we can estimate 
the parameters in this equation by an OLS procedure. Equation (6.5) is often 
called the quasi-difference transformation of (6.4). What we do is transform 
the variables y, and x, to 

y, = y, - p;y,-i t = 2, 3,..., t (6 6) 

jc* = x, — px,_i 

and run a regression of y* on jc*, with or without a constant term depending on 
whether the original equation has a constant term or not. In this method we 
use only ( T - 1) observations because we lose one observation in the process 
of taking differences. This procedure is not exactly the GLS procedure. The 
GLS procedure amounts to using all the T observations with 

y\ = Vi - pl vi (6.6') 

x\ = Vi - pV 

and regressing y* on jc* using the T observations. 

In actual practice p is not known. There are two types of procedures for 
estimating p: 

1. Iterative procedures. 

2. Grid-search procedures. 


7 The derivation involves the use of matrix algebra, which we have avoided. 

8 If the number of observations is large, the omission of these initial observations does not mat¬ 
ter much. 
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Iterative Procedures 

Among the iterative procedures, the earliest was the Cochrane-Orcutt proce¬ 
dure. 9 In the Cochrane-Orcutt procedure we estimate equation (6.2) by OLS, 
get the estimated residuals u„ and estimate p by p = ^ ufit- 10£ «?• 

Durbin 10 * suggested an alternative method of estimating p. In this procedure, 
we write equation (6.5) as 

y, = a(l - p) + py,., + Px, - Ppx, , + e, (6.7) 

We regress y, on y, ,, x„ and x,_, and take the estimated coefficient of y, , as 
an estimate of p. 

Once an estimate of p is obtained, we construct the transformed variables y* 
and x* as defined in (6.6) and (6.6') and estimate a regression of y* on x*. The 
only thing to note is that the slope coefficient in this equation is p, but the 
intercept is a(l - p). Thus after estimating the regression of y* on x*, we have 
to adjust the constant term appropriately to get estimates of the parameters of 
the original equation (6.2). Further, the standard errors we compute from the 
regression of y* on x* are now “asymptotic” standard errors because of the 
fact that p has been estimated. If there are lagged values of y as explanatory 
variables, these standard errors are not correct even asymptotically. The ad¬ 
justment needed in this case is discussed in Section 6.7. 

If there are many explanatory variables in the equation, Durbin’s method 
involves a regression in too many variables (twice the number of explanatory 
variables plus y,„,). Hence it is often customary to prefer the Cochrane-Orcutt 
procedure until it converges. However, there are examples" to show that the 
minimization of ^ <?? in (6.5) can produce multiple solutions for p. In this case 
the Cochrane-Orcutt procedure, which relies on a single solution for p, might 
give a local minimum, and even when iterated might converge to a local mini¬ 
mum. Hence it is better to use a grid-search procedure, which we will now 
describe. 


Grid-Search Procedures 

One of the first grid-search procedures is the Hildreth and Lu procedure 12 sug¬ 
gested in 1960. The procedure is as follows. Calculate y ( * and x* in (6.6) for 
different values of p at intervals of 0.1 in the range - 1 < p < 1. Estimate the 
regression of y* on x] and calculate the residual sum of squares RSS in each 

9 D. Cochrane and G. H. Orcutt, “Application of Least Squares Regressions to Relationships 
Containing Autocorrelated Error Terms,” Journal of the American Statistical Association, 
1949, pp. 32-61. 

I0 J. Durbin, “Estimation of Parameters in Time Series Regression Models,” Journal of the 
Royal Statistical Society, Series B, 1960, pp. 139-153. 

"J. M. Dufour, M. J. I. Gaudry, and T. C. Lieu, “The Cochrane-Orcutt Procedure: Numerical 
Examples of Multiple Admissible Minima,” Economics Letters, 1980 (6), pp. 43-48. 

l! Clifford Hildreth and John Y. Lu, Demand Relations with Autocorrelated Disturbances, AES 
Technical Bulletin 276, Michigan State University, East Lansing, Mich., November 1960. 
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case. Choose the value of p for which the RSS is minimum. Again repeat this 
procedure for smaller intervals of p around this value. For instance, if the value 
of p for which RSS is minimum is - 0.4, repeat this search procedure for values 
of p at intervals of 0.01 in the range - 0.5 <p< -0.3. 

This procedure is not the same as the maximum likelihood (ML) procedure. 
If the errors e, are normally distributed, we can write the log-likelihood func¬ 
tion as (derivation is omitted) 


where 


log L = const. 


T 1 

- log o i + j k) £ (1 



( 6 . 8 ) 


Q = S \y, - py, , - a(l - p) - p(x, - px, ,)] 2 

Thus minimizing Q is not the same as maximizing log L. We can use the grid- 
search procedure to get the ML estimates. The only difference is that after we 
compute the residual sum of squares RSS(p) for each p, we choose the value 
of p for which (772) log RSS(p) - (1/2) log (1 - p 2 ) is minimum. If the number 
of observations is large, the latter term will be small compared to the former, 
and the ML procedure and the Hildreth-Lu procedure will give almost the 
same results. 


Illustrative Example 

Consider the data in Table 3.11 and the estimation of the production function 

log X = a + (3, log L, + p 2 log K, + u 

The OLS estimation gave a DW statistic of 0.86, suggesting significant posi¬ 
tive autocorrelation. Assuming that the errors were AR(1), two stimation pro¬ 
cedures were used: the Hildreth-Lu grid search and the iterative Cochrane- 
Orcutt (C-O). The other procedures we have described can also be tried, but 
this is left as an exercise. 

The Hildreth-Lu procedure gave p = 0.77. The iterative C-O procedure 
gave p = 0.80. The DW test statistic implies that p = (l/2)(2 - 0.86) = 0.57. 

The estimates of the parameters (with the standard errors in parentheses) 
were as follows: 


Estimate of' 

OLS 

Hildreth-Lu 

Iterative C-O 

a 

-3.938 

-2.909 

-2.737 


(0 237) 

( 0 . 462 ) 

(0 461) 

P, 

1.451 

1.092 

1.070 


(0 083) 

(0 151) 

(0 153) 

p2 

0.384 

0.570 

0.558 


(0 048) 

(0 104) 

(0 097) 

RSS 

0.04338 

0.02635 

0.02644 


In this example the parameter estimates given by Hildreth-Lu and the iter¬ 
ative C-O procedures are pretty close to each other. Correcting for the auto- 
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correlation in the errors has resulted in a significant change in the estimates of 
3, and p 2 . 


6.5 Effect of AR( 1) Errors on OLS Estimates 


In Section 6.4 we described different procedures for the estimation of regres¬ 
sion models with AR(1) errors. We will now answer two questions that might 
arise with the use of these procedures: 

1. What do we gain from using these procedures? 

2. When should we not use these procedures? 

First, in the case we are considering (i.e., the case where the explanatory 
variable x, is independent of the error «,), the ordinary least squares (OLS) 
estimates are unbiased. However, they will not be efficient. Further, the tests 
of significance we apply, which will be based on the wrong covariance matrix, 
will be wrong. In the case where the explanatory variables include lagged de¬ 
pendent variables, we will have some further problems, which we discuss in 
Section 6.7. For the present, let us consider the simple regression model 

y, = P*, + U, (6.9) 

Let var(w,) = a 2 , cov(«„ u,~) — p ; ct 2 . If u, are AR(1), we have p, = p'. 

The OLS estimator of p is 



Hence 

P - P = and £(p - p) = 0 

V( P) = ( £ 1 JC 2)2 var( S 

CT 2 

= (2 x f)2 ( ^ + 2 P 2 JCA-1 + 2p 2 2 + • • •) 

since cov(w„ u, ,) = p'cr 2 . Thus we have 

(P) = 2^ ( 2p ^r + p TT" + " j (610) 

If we ignore the autocorrelation problem, we would be computing V(j3) 
as ct 2 /2 x;. Thus we would be ignoring the expression in the parentheses of 
(6.10). To get an idea of the magnitude of this expression, let us assume that 
the x, series also follow an AR(1) process with var(x,) = tr 2 and cor(x„ x, _ ,) = 
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r. Since we are now assuming x t to be stochastic, we will consider the asymp¬ 
totic variance of p. The expression in parentheses in (6.10) is now 


Thus 


1 + 2pr + 2pV + ••• = ! + 


2pr 

1 — pr 


1 + pr 
1 - pr 


V(P) = 


a 2 1 + rp 
Tu 2 x 1 - rp 


where T is the number of observations. If r = p = 0.8, then 


1 + rp _ 1.64 
1 — rp 0.36 


Thus ignoring the expression in the parentheses of (6.10) results in an under¬ 
estimation by close to 78% for the variance of p. 

One further error is also involved. This is that we use 2 w 2 /(T - 1) as an 
estimate of or 2 . If p = 0, this is an unbiased estimate of a 2 . If p ¥=■ 0, then under 
the assumptions we are making, we have approximately 13 

^ - f^) 


Again if p = r = 0.8 and T = 20, we have 



15.45 , „ , 

- a 2 — 0.8 lo 2 

19 


Thus there is a further underestimation of 19%. Both these effects result in an 
underestimation of the standard errors of more than 80%. 

We can also derive the asymptotic variance of the ML estimator p when both 
x and u are first order autoregressive as follows: Note that the ML estimator 
of p is asymptotically equivalent to the estimator obtained from a regression of 
(y, - py,-,) on (x, - px f ,). Hence 


V(P) = var 


2 



px, I)(«, - p«,_ |) 
(X, - pX t _ ! ) 2 


l3 Note that «, = «,- x,(P - (3). Hence ECiu]) = + £[((3 - (3) 2 3,xj] - 2£[((3 - (i)£*,«,]. 

The first term is Tv 2 . The second term is 


The last term is 


1 - rp 


— 2a 


, 1 + rp 
1 - rp 


In all this note that we take probability limits rather than expectations, since these results are 
all asymptotic. 
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, 2 (x, - px,_,)e, 

- var — - - 

h ( X t ~ P*,-l) 

2 (*, - p*,_,) 2 

where ct 2 = var(e,). When x, is autoregressive we have 


lira ^2 0c t - P^-i) 2 = <r 2 (l + P 2 - 2rp) 

Also, 

O’ 2 

var(tt) = o- 2 = -—~—; or a 2 = a 2 (l - p 2 ) 

1 ~ P 

Hence by substitution we get the asymptotic variance of (3 as 

v(P)-^ , + ‘ p ;_ pi 2rp 

Thus the efficiency of the OLS estimator is 

V(P) _ 1 - rp 1 - p 2 
V(P) 1 + rp 1 + p 2 — 2rp 


One can compute this for different values of r and p. For r = p = 0.8 this 
efficiency is 0.21. 

Thus the consequences of autocorrelated errors are: 


1. The least squares estimators are unbiased but are not efficient. Some¬ 
times they are considerably less efficient than the procedures that take 
account of the autocorrelation. 

2. The sampling variances are biased and sometimes likely to be seriously 
understated. Thus R 2 as well as t and F statistics tend to be exaggerated. 

The solution to these problems is to use the maximum likelihood procedure 
or some other procedure mentioned earlier that takes account of the autocor¬ 
relation. However, there are four important points to note: 

1. If p is known, it is true that one can get estimators better than OLS that 
take account of autocorrelation. However, in practice p is not known and 
has to be estimated. In small samples it is not necessarily true that one 
gains (in terms of mean-square error for p) by estimating p. This problem 
has been investigated by Rao and Griliches, 14 who suggest the rule of 


l4 P. Rao and Z. Griliches, “Some Small Sample Properties of Several Two-Stage Regression 
Methods in the Context of Autocorrelated Errors,” Journal of the American Statistical Asso¬ 
ciation, March 1969. 
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thumb (for samples of size 20) that one can use the methods that take 
account of autocorrelation if |p| a .3, where p is the estimated first-order 
serial correlation from an OLS regression. 15 In samples of larger sizes it 
would be worthwhile using these methods for p smaller than 0.3. 

2. The discussion above assumes that the true errors are first-order autore¬ 
gressive. If they have a more complicated structure (e.g., second-order 
autoregressive), it might be thought that it would still be better to proceed 
on the assumption that the errors are first-order autoregressive rather 
than ignore the problem completely and use the OLS method. Engle 16 
shows that this is not necessarily true (i.e., sometimes one can be worse 
off making the assumption of first-order autocorrelation than ignoring the 
problem completely). 

3. In regressions with quarterly (or monthly) data, one might find that the 
errors exhibit fourth (or twelfth)-order autocorrelation because of not 
making adequate allowance for seasonal effects. In such cases if one 
looks for only first-order autocorrelation, one might not find any. This 
does not mean that autocorrelation is not a problem. In this case the ap¬ 
propriate specification for the error term may be u, = p«, .. 4 + e, for 
quarterly data and u, = pa, , 2 + e, for monthly data. 

4. Finally, and most important, it is often possible to confuse misspecified 
dynamics with serial correlation in the errors. For instance, a static 
regression model with first-order autocorrelation in the errors, that is, 
y, = Pjc, + u„ u, = pa, | + e„ can be written as 


y, = py,-i 

This model is the same as 

+ (3x, - Ppx,-, + e, 

(6.11) 

y, = a i>7 

i + a 2 x, + a 3 x, | + e, 

(6.11') 


with the restriction a,a 2 + a 3 = 0. We can estimate the model (6.11') and 
test this restriction. If it is rejected, clearly it is not valid to estimate 
(6.11). (The test procedure is described in Section 6.8.) 

The errors would be serially correlated but not because the errors follow a 
first-order autoregressive process but because the terms x,_, and y, have been 
omitted. This is what is meant by “misspecified dynamics.” Thus a significant 
serial correlation in the estimated residuals does not necessarily imply that we 
should estimate a serial correlation model. Some further tests are necessary 
(like the restriction a,a, + a 3 = 0 in the above-mentioned case). In fact, it is 
always best to start with an equation like (6.11') and test this restriction before 
applying any tests for serial correlation. 

ls Of course, it is not sufficient to argue in favor of OLS on the basis of mean-square errors of 
the estimators alone. What is also relevant is how seriously the sampling variances are biased. 
l6 Robert F. Engle, “Specification of the Disturbance for Efficient Estimation,” Econometrica, 
1973. 
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6.6 Some Further Comments on 
the DW Test 


In Section 6.2 we discussed the Durbin-Watson test for first-order autocorre¬ 
lation which is based on least-squares residuals. There are two other tests that 
are also commonly used to test first-order autocorrelation. These are 

1. The von Neumann ratio. 

2. The Berenblut-Webb test. 

We will briefly describe what they are: 


The von Neumann Ratio 

The von Neumann ratio 17 is defined as 

8 2 2 - e >-\y/(n - i) 

S 2 ( e , ~ eYIn 

^ '=1 

where e, are the residuals. The von Neumann ratio can be used only when e, 
are independent (under the null hypothesis) and have a common variance. The 
least squares residuals u, do not satisfy these conditions and hence one cannot 
use the von Neumann ratio with least squares residuals. 

During recent years there are a large number of alternative residuals that 
have been suggested for the linear regression model. Many of these residuals, 
particularly the “recursive residuals,” satisfy the properties that they are in¬ 
dependent and have a common variance. These different types of residuals are 
useful for diagnostic checking of the regression model and are discussed in 
Chapter 12. The recursive residuals, in particular, can easily be computed. 
Since they are independent and have a common variance, one can use them to 
compute the von Neumann ratio, as suggested by Phillips and Harvey. 18 

For large samples b 2 /s 2 can be taken as normally distributed with mean and 
variance given by 



2 n 

n - 1 


l7 J. von Neumann, “Distribution of the Ratio of the Mean Square Successive Difference to the 
Variance,” Annals of Mathematical Statistics, 1941, pp. 367-395. 

,8 G. D. A. Phillips and A. C. Harvey, “A Simple Test for Serial Correlation in Regression Anal¬ 
ysis,” Journal of the American Statistical Association, December 1974, pp. 935-939. 
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/8^\ _ 4 n 2 (n - 2) 

\s 2 / (n + l)(n — l) 3 

For finite samples one can use the tables prepared by G. I. Hart, published in 
Annals of Mathematical Statistics, 1962, pp. 207-214. 

There are many other residuals suggested in the literature for the purpose of 
testing serial correlation. These are the Durbin residuals, Sims’ residuals, and 
so on. But all these are more complicated to compute. The recursive residuals, 
which are useful for analysis of stability of the regression relationships and are 
easy to compute, can be used for tests for serial correlation in case the Durbin- 
Watson test is inconclusive. 


on the statistic 

~ l ^ 2 
~ I# 

i 

where e, are the estimated residuals from a regression of first difference of y on 
first differences of the explanatory variables (with no constant term). If the 
original equation contains a constant term, we can use the Durbin-Watson ta¬ 
bles on bounds with the g-statiStic. The g-statistic is useful when values of 
|p| a 1 are possible. 

The literature on the DW test and the problem of testing for autocorrelation 
is enormous. We will summarize a few of the important conclusions: 

A. Since the DW statistic is usually printed out from almost all computer 
programs, and the tables for its use are readily available, one should use this 
test with least squares residuals. However, with most economic data it is better 
to use the upper bound as the true significance point (i.e., treat the inconclusive 
region as a rejection region). For instance, with n — 25 and the number of 
explanatory variables 4, we have d L = 1.04 and d v = 1.77 as the 5% level 
significance points. Thus if the computed DW statistic is d = 1.5, we would 
normally say that the test is inconclusive at the 5% level. Treating d v as the 5% 
significance point, we would reject the null hypothesis p = 0 at the 5% level. 
If more accuracy is required when d is in the inconclusive region, there are a 
number of alternatives suggested but all are computationally burdensome. The 
whole idea of testing for serial correlation is that if we do not reject the hy¬ 
pothesis p = 0, we can stay with OLS and avoid excessive computational bur¬ 
den. Thus trying to use all these other tests is more burdensome than estimating 
the model assuming p # 0. If we generate the recursive residuals for some other 
purposes, we can apply the von Neumann ratio test using these residuals. Also, 


The Berenblut—Webb Test 

The Berenblut-Webb test 19 is based 


g 


l9 I. I. Berenblut and G. I. Webb, “A New Test for Autocorrelated Errors in the Linear Regres¬ 
sion Model,” Journal of the Royal Statistical Society, Series B, Vol. 35, 1973, pp. 33-50. 
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if we are estimating the first difference equation, we can use the Berenblut- 
Webb test as well, without any extra computational burden. 

B. There are many tables other than those reprinted at the end of this book 
for the DW test that have been prepared to take care of special situations. Some 
of these are: 

1. R. W. Farebrother in Econometrica, Vol. 48, September 1980, pp. 1553— 
1563, gives tables for regression models with no intercept term. 

2. N. E. Savin and K. J. White, in Econometrica, Vol. 45, No. 8, November 
1977, pp. 1989-19%, present tables for the DW test for samples with 6 to 
200 observations and for as many as 20 regressors. 

3. K. F. Wallis in Econometrica, Vol. 40, 1972, pp. 617-636, gives tables for 
regression models with quarterly data. Here one would like to test for 
fourth-order autocorrelation rather than first-order autocorrelation. In 
this case the DW statistic is 


2 (a, - «, 4 ) 2 



(=i 


Wallis provides 5% critical values d L and d v for two situations: where the 
k regressors include an intercept (but not a full set of seasonal dummy 
variables) and another where the regressors include four quarterly sea¬ 
sonal dummy variables. In each case the critical values are for testing H 0 : 
p = 0 against //,: p > 0. For the hypothesis //,: p < 0 Wallis suggests 
that the appropriate critical values are (4 - d v ) and (4 - d L ). M. L. King 
and D. E. A. Giles in Journal of Econometrics, Vol. 8, 1978, pp. 255- 
260, give further significance points for Wallis’s test. 

4. M. L. King in Econometrica, Vol. 49, November 1981, pp. 1571-1581, 
gives the 5% points for d L and d,, for quarterly time-series data with trend 
and/or seasonal dummy variables. These tables are for testing first-order 
autocorrelation. 

5. M. L. King in Journal of Econometrics, Vol. 21, 1983, pp. 357-366, gives 
tables for the DW test for monthly data. In the case of monthly data we 
would want to test for twelfth-order autocorrelation. 

C. All the elaborate tables mentioned in B have been prepared for 5% level 
of significance (and 1% level of significance) and a question arises as to what 
the appropriate level of significance is for the DW test. Given that the test for 
serial correlation is a prelude to further estimation and not an end in itself, the 
theory of pretest estimation suggests that a significance level of, say, 0.35 or 
0.4 is more appropriate than the conventional 0.05 significance level. 20 

“See, for instance, T. B. Fomby and D. K. Guilkey, “On Choosing the Optimal Level of Sig¬ 
nificance for the Durbin-Watson Test and a Bayesian Alternative,” Journal of Econometrics, 
Vol. 8, 1978, pp. 203-214. 



248 


6 AUTOCORRELATION 


D. A significant DW statistic can arise from a lot of different sources. The 
DW statistic can detect moving average errors, AR(2) errors, or just the effect 
of omitted variables that are themselves autocorrelated. This raises the ques¬ 
tion of what the appropriate strategy should be when the DW statistic is signif¬ 
icant. It does not necessarily imply that the errors are AR(1). One can proceed 
in different directions. The different strategies are discussed in Section 6.9. 

Finally, the DW test is not applicable if the explanatory variables contain 
lagged dependent variables. The appropriate tests are discussed in the next 
section. 


6.7 Tests for Serial Correlation in Models 
with Lagged Dependent Variables 

In previous sections we considered explanatory variables that were uncorre¬ 
lated with the error term. This will not be the case if we have lagged dependent 
variables among the explanatory variables and we have serially correlated er¬ 
rors. There are several situations under which we would be considering lagged 
dependent variables as explanatory variables. These could arise through ex¬ 
pectations, adjustment lags, and so on. The various situations and models are 
explained in Chapter 10. For the present we will not be concerned with how 
the models arise. We will merely study the problem of testing for autocorrela¬ 
tion in these models. 

Let us consider a simple model 

y, = ay,-i + P*, + U, (6.12) 

u, = pw,„, + e, (6.13) 

e, are independent with mean 0 and variance a 2 and |a| < 1, |p| < 1. Because u, 
depends on u, and y ,_, depends on u, ,, the two variables y, and u, will be 
correlated. The least squares estimator a. will be inconsistent. It can be shown 
that 21 


and 


where 


plim a = a + A 
plim p = p - A 


A = 


pqfof, 

(1 - ap)D 


21 The proofs are somewhat long and are omitted. For a first-order autoregressive x, they can be 
found in G. S. Maddala and A. S. Rao, “Tests for Serial Correlation in Regression Models with 
Lagged Dependent Variables and Serially Correlated Errors,” Econometrica, Vol. 47, No. 4, 
July 1973, App. A, pp. 761-774. 
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D = var(y_,) var(x) - cov 2 (y ,, x) > 0 
trj = var(x) of, = var(w) 

Thus if p is positive, the estimate of a is biased upward and the estimate of p 
is biased downward. Hence the DW statistic, which is — 2(1 - p) is biased 
toward 2 and we would not find any significant serial correlation even if the 
errors are serially correlated. 

Duribin’s h -Test 

Since the DW test is not applicable in these models, Durbin suggests an alter¬ 
native test, called the h-test. 22 This test uses 



as a standard normal variable. Here the p is the estimated first-order serial 
correlation from the OLS residuals V(a) is the estimated variance of the OLS 
estimate of a, and n is the sample size. If nV(a) > 1, the test is not applicable. 
In this case Durbin suggests the following test: 

Durbin’s Alternative Test 

From the OLS estimation of (6.12) compute the residuals <2,. Then regress Ci, 
on m,_,, y,_ u and x r The test for p = 0 is carried out by testing the significance 
of the coefficient of Ci, , in the latter regression. 

Illustrative Example 

An equation of demand for food estimated from 50 observations gave the fol¬ 
lowing results. 

log q, = const. — 0.31 log P, + 0.45 log y, + 0.65 log q, , 

(0.05) (0.20) (0.14) 

R 2 = 0.90 DW = 1.8 
(Figures in parentheses are standard errors.) 
q, = food consumption per capita 

p, = food price (retail price deflated by the Consumer Price Index) 
y, = per capita disposable income deflated by the Consumer Price Index 

22 J. Durbin, “Testing for Serial Correlation in Least Squares Regression When Some of the 
Regressors Are Lagged Dependent Variables,” Eeonometriea, 1970, pp. 410-421. Durbin’s pa¬ 
per was prompted by a note by Nerlove and Wallis that argued that the DW statistic is not 
applicable when lagged dependent variables are present. See M. Nerlove and K. F. Wallis, 
“Use of DW Statistic in Inappropriate Situations," Eeonometriea, Vol. 34, 1966, pp. 235-238. 
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We have 

a = 0.65 V(a) = (0.14) 2 = 0.0196 
p = 0.1 since DW — 2(1 — p) 

Hence Durbin’s /i-statistic is 

* - 0 -' V » -moiw - 50 

This is significant at the 1% level. Thus we reject the hypothesis p = 0, even 
though the DW statistic is close to 2 and the estimate p from the OLS residuals 
is only 0.1. 

Let us keep all the numbers the same and just change the standard error of 
a. The following are the results: 


SEX a) 

m) 

1 - nV(a) 

h Conclusion 

0.13 

0.0169 

0.155 

1.80 Not significant at 




the 5% level 

0.15 

0.0225 

-0.125 

Test not defined 


Thus, other things equal, the precision with which a is estimated has significant 
effect on the outcome of the h-lest. 

In the case where the h-test cannot be used, we can use the alternative test 
suggested by Durbin. However, the Monte Carlo study by Maddala and Rao : 
suggests that this test does not have good power in those cases where the h- 
test cannot be used. On the other hand, in cases where the /i-test can be used. 
Durbin’s second test is almost as powerful. It is not often used because it in¬ 
volves more computations. However, we will show that Durbin’s second test 
can be generalized to higher-order autoregressions, whereas the /i-test cannot 
be. 


6.8 A General Test for Higher-Order Serial 
Correlation: The LM Test 


The h -test we have discussed is, like the Durbin-Watson test, a test for first- 
order autoregression. Breusch 24 and Godfrey 25 discuss some general tests that 
are easy to apply and are valid for very general hypotheses about the serial 

21 Maddala and Rao, “Tests for Serial Correlation.” 

24 T. S. Breusch, “Testing for Autocorrelation in Dynamic Linear Models,” Australian Eco¬ 
nomic Papers, Vol. 17, 1978, pp. 334-355. 

!, L. G. Godfrey, “Testing for Higher Order Serial Correlation in Regression Equations When 
the Regressors Include Lagged Dependent Variables,” Econometrica, Vol. 46, 1978, pp. 1303— 
1310. 
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correlation in the errors. These tests are derived from a general principle— 
called the Lagrange multiplier (LM) principle. A discussion of this principle is 
beyond the scope of this book. For the present we will explain what the test 
is. The test is similar to Durbin’s second test that we have discussed. 

Consider the regression model 

k 

y, = 2 *«P, + u, t = 1, 2, . . . , n (6.14) 

1=1 


and 


u, = Pi«,-1 + p 2 «,-2 + ' ' • + P p u,~p + e,e,~ IN(0, cr 2 ) (6.15) 

We are interested in testing H 0 : Pi = p 2 = • • • = p p = 0. The x's in equation 
(6.14) include lagged dependent variables as well. The LM test is as follows: 

First, estimate (6.14) by OLS and obtain the least squares residuals u,. Next, 
estimate the regression equation 

* p 

u, = 2 x„y, + 2 «,-,P, + t), (6.16) 

t= i i=i 

and test whether the coefficients of u, , are all zero. We take the conventional 
F-statistic and use p ■ F as x 2 with degrees of freedom p. We use the x 2 -test 
rather than the F-test because the LM test is a large sample test. 

The test can be used for different specifications of the error process. For 
instance, for the problem of testing for fourth-order autocorrelation 

u, = p 4 «/-4 + e t (6.17) 

we just estimate 

* ) 

U, = 2 -vy, + P4*V 4 + 'll, (6.18) 

1=1 

instead of (6.16) and test p 4 = 0. 

The test procedure is the same for autoregressive or moving average errors. 
For instance, if we have a moving average (MA) error 

Uj + P4&t — 4 

instead of (6.17), the test procedure is still to estimate (6.18) and test p 4 = 0. 
Consider the following types of errors: 

AR(2): u, = p,«,_i + p 2 u, . 2 + e, 

MA(2): «, = £>, + + p 2 c ( _ 2 

AR(2) with interaction: u, = p,H f _, + p 2 w, 2 - PiP 2 «,_ 3 + e, 

In all these cases, we just test H 0 by estimating equation (6.16) with p = 2 and 
test Pi = p 2 = 0. What is of importance is the degree of autoregression, not the 
nature. 
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Finally, an alternative to the estimation of (6.16) is to estimate the equation 

k p 

y, = S + 2 «r-.P. + (6.19) 

Thus the LM test for serial correlation is: 

1. Estimate equation (6.14) by OLS and get the residual u t . 

2. Estimate equation (6.16) or (6.19) by OLS and compute the F-statistic for 
testing the hypothesis H 0 : Pi = p 2 = • • • p p = 0. 

3. Use p ■ F as x 2 with p degrees of freedom. 


6.9 Strategies When the DW Test 
Statistic is Significant 

The DW test is designed as a test for the hypothesis p = 0 if the errors follow 
a first-order autoregressive process «, = pu, , + e,. However, the test has been 
found to be robust against other alternatives such as AR(2), MA(1), ARMA(1, 
1), and so on. Further, and more disturbingly, it catches specification errors 
like omitted variables that are themselves autocorrelated, and misspecified dy¬ 
namics (a term that we will explain). Thus the strategy to adopt, if the DW test 
statistic is significant, is not clear. We discuss three different strategies. 

1. Assume that the significant DW statistic is an indication of serial corre¬ 
lation but may not be due to AR(1) errors. 

2. Test whether serial correlation is due to omitted variables. 

3. Test whether serial correlation is due to misspecified dynamics. 


Errors Not AR( 1) 

In case 1, if the DW statistic is significant, since it does not necessarily mean 
that the errors are AR(1), we should check for higher-order autoregressions by 
estimating equations of the form 


U, - Pi«,-1 + p 2 «/-2 + e, 

Once the order has been determined, we can estimate the model with the 
appropriate assumptions about the error structure by the methods described in 
Section 6.4. Actually, there are two ways of going about this problem of deter¬ 
mining the appropriate order of the autoregression. The first is to progressively 
complicate the model by testing for higher-order autoregressions. The second 
is to start with an autoregression of sufficiently high order and progressively 
simplify it. Although the former approach is the one commonly used, the latter 
approach is better from the theoretical point of view. 
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One other question that remains is that of moving average errors and ARMA 
errors. Estimation with moving average errors and ARMA errors is more com¬ 
plicated than with AR errors. Moreover, Hendry 26 and Hendry and Trivedi 27 
argue that it is the order of the error process that is more important than the 
particular form. Thus from the practical point of view, for most economic data, 
it is just sufficient to determine the order of the AR process. Thus if a significant 
DW statistic is observed, the appropriate strategy would be to try to see 
whether the errors are generated by a higher-order AR process than AR(1) and 
then undertake estimation. 

Autocorrelation Caused by Omitted Variables 

Case 2, serial correlation being caused by omitted variables, is rather difficult 
to tackle. It is often asserted that the source of serial correlation in the errors 
is that some variables that should have been included in the equation are omit¬ 
ted and that these omitted variables are themselves autocorrelated. However, 
if this is the argument for serial correlation, and it is an appealing one, one 
should be careful in suggesting the methods that we have discussed until now. 
Suppose that the true regression equation is 

y, = Po + Pa + Pa 2 + 

and instead we estimate 

y, = Po + PA + v, (6.20) 

Then since v, = (3 2 jc 2 + u„ if x, is autocorrelated, this will produce autocorre¬ 
lation in v,. However, v, is no longer independent of x,. Thus not only are the 
OLS estimators of (3 0 and p, from (6.20) inefficient, they are inconsistent as 
well. 

As yet another example, suppose that the true relation is 

, y, = Pa + Pa + (6.21) 

and we estimate 

y, = Pa + w, (6.22) 

Again, if z, are autocorrelated, w, will also be. But if z, and x, are independent, 
the methods we have discussed earlier are applicable. Thus, to justify the meth¬ 
ods of estimation we have discussed, we have to argue that the autocorrelated 
omitted variables that are producing the autocorrelation in the errors, are un- 


26 D. F. Hendry, “Comments on the Papers by Granger-Newbold and Sargent-Sims.” in New 
Methods in Business Cycle Research (Minneapolis: Federal Reserve Bank of Minneapolis, Oc¬ 
tober 1977). 

27 D. F. Hendry and P. K. Trivedi, “Maximum Likelihood Estimation of Difference Equations 
with Moving Average Errors: A Simulation Study,” The Review of Economic Studies, Vol. 39, 
April 1972, pp. 117-145. 
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correlated with the included explanatory variables. Further, if there are any 
time trends in these omitted variables, they will produce not only autocorre- 
lated errors but also heteroskedastic errors. 

In equation (6.21) let us assume that the errors u, are independent with a 
common variance cr. However, we estimate equation (6.22) and compute the 
DW statistic d. What can we say about it? Note that since the least squares 
residuals are always uncorrelated with the included variables (by virtue of the 
normal equations), the DW statistic d is determined not by the autocorrelation 
in z, but the autocorrelation in z], which is that part of z, left unexplained by 
x,. 

Consider a regression of z, on x,. Let the regression coefficient be denoted 
by b. Then z, = bx, + z‘,, where z, is the residual from a regression of z, on x,. 
Equation (6.21) can be written as 

y, = Pi*, + fii(bx, + z',) + u, (6 23) 

= (Pi + P 2 b)x, + w, 

where w, = p 2 z* + u,. 

If we estimate (6.22) by OLS and (3, is the OLS estimator of P,, then 
E’(Pi) = p, + p 2 fc and the residual w, would be estimating (3 2 z* + u,. Let 
var(z,‘) = cr* and cov(z*, z*,) = p*oL Then since cov(z,\ u,) = Owe have 

cov(p 2 zj + u„ £ 2 zl , + «,_,) = P!p*o-! 

and 


var(P 2 z,‘ + u,) = fijal + a 2 
The first-order serial correlation in w, would be 28 


1 + cr 2 /p 2 2 cxl 

If d is the DW statistic from OLS estimation of (6.22), then 

plim d = 2(1 - pj 

Note that the observed serial correlation depends on the serial correlation in 
z,, not z, (the omitted variable), and the variance ratio cF/pifoL If this variance 
ratio is large, then, even if p* is high, p„ can be small. 

One can test for omitted variables using the RESET test of Ramsey or 
White’s test outlined in Section 5.2. If the DW test statistic is significant but 
these tests also show significance, the appropriate strategy would be to esti¬ 
mate the model by some general procedure like the procedure described in 
Section 5.4 rather than use a transformation based on the estimated first-order 
autocorrelation. 


“This formula has been derived in a more general case in M. Chaudhuri, “Autocorrelated Dis¬ 
turbances in the Light of Specification Analysis,” Journal of Econometrics, Vol. 5, 1977, pp. 
301-313. 
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Serial Correlation Due to Misspecified Dynamics 

In a seminal paper published in 1964, Sargan 29 pointed out that a significant 
DW statistic does not necessarily imply that we have a serial correlation prob¬ 
lem. This point was also emphasized by Henry and Mizon. 30 The argument 


goes as follows. Consider 

y, = p*, + u, with u, = p«, , + e, (6.24) 

and e, are independent with a common variance cr 2 . We can write this model as 
y t = py, -i + 3*, - Pp*, I + e, (6.25) 

Consider an alternative stable dynamic model: 

y, = + 32*/ + 3a*f-i + e, l3il < 1 ( 6 - 26 ) 

Equation (6.25) is the same as equation (6.26) with the restriction 

P,3 2 + 3, = 0 (6.27) 


A test for p = 0 is a test for 3i = 0 (and p 3 = 0). But before we test this, what 
Sargan says is that we should first test the restriction (6.27) and test for p = 0 
only if the hypothesis H 0 : 3i3j + 3j = 0 is not rejected. If this hypothesis is 
rejected, we do not have a serial correlation model and the serial correlation in 
the errors in (6.24) is due to “misspecified dynamics,” that is, the omission of 
the variables y,_, and x, , from the equation. 

The restriction (6.27) is nonlinear in the P’s and hence one has to use the 
Wald test or the LR or LM tests. If the DW test statistic is significant, a proper 
approach is to test the restriction (6.27) to make sure that what we have is a 
serial correlation model before we undertake any autoregressive transforma¬ 
tion of the variables. In fact, Sargan suggests starting with the general model 
(6.26) and testing the restriction (6.27) first, before attempting any tests for 
serial correlation. 31 

In this case there is, in general, no exact r-test as in the case of linear restric¬ 
tions. What we do is linearize the restriction by a Taylor series expansion and 
use what is known as a Wald test, which is an asymptotic test (or use the LR 
or LM tests). 


”J. D. Sargan, “Wages and Prices in the United Kingdom: A Study in Econometric Method¬ 
ology,” in P. E. Hart, G. Mills, and J. K. Whitaker (eds.). Econometric Analysis for National 
Economic Planning, Colston Papers 16 (London: Butterworth, 1964). pp. 25-54. 

"D. F. Hendry and G. E. Mizon, “Serial Correlation as a Convenient Simplification, Not a 
Nuisance: A Comment on a Study of the Demand for Money by the Bank of England," The 
Economic Journal, Vol. 88. September 1978, pp. 549-563. 

”J. G. Thursby, "A Test Strategy for Discriminating Between Autocorrelation and Misspecifi- 
cation in Regression Analysis.” The Review of Economics and Statistics, Vol. 63, 1981, pp. 
117-123 considers the use of the DW, Ramsey’s RESET, and Sargan’s test together. This is 
useful for warning against autoregressive transformations based on the DW statistic but does 
not tell what the estimation strategy should be after the tests. 
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The Wald Test 

Define 


W) = PiPa + & 

Using a first-order Taylor series expansion, we get 

3 

m = m + 2 sxp, - p.) 

i =i 


where g, = d/7dp,. Under the null hypothesis /(p) = 0 and 


var(/(P)] = 



i|j 2 (say) 


since ' 

cov(P„p,) = or 2 C,j 

The Wald test statistic is obtained by substituting g, for g, and ex 2 for cr 2 in »J< 2 . 
Denoting the resulting expression by 4» 2 , we get the statistic 

c» 2 


which has (asymptotically) a x 2 -distribution with 1 degree of freedom. 

In the particular case we are considering, note that g, = p 2 , g 2 = Pi, and 
ft = 1- 

However, there are some problems with the Wald test. The restriction (6.27) 
can as well be written as 


or 


m) = p. + = 0 

H2 


m) = p 2 + f 2 = o 

Pi 


If we write it as (6.28), we have 

ft = 1 ft 
and if we write it as (6.28') we have 


Pi 


ft = 


ft = -ri ft = 1 ft 

Pi 


P 2 

P. 


(6.28) 


(6.28') 


Although, asymptotically, it should not matter how the Wald test is con¬ 
structed, in practice it has been found that the results differ depending on how 
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we formulate the restrictions. 32 However, formulations (6.28) and (6.28') im¬ 
plicitly assume that p 2 ^ 0 or (3, # 0, respectively, and thus in this case it is 
more meaningful to use the restriction in the form (6.27) rather than (6.28) or 
(6.28'). 

Note that a hypothesis like p,/p 2 = C can be transformed into a linear hy¬ 
pothesis Pi - Cp 2 = 0. Similarly, p,/(l - p 2 ) = C can be transformed to p, 
+ Cp 2 - C = 0. On the other hand, if, for some reason, an exact confidence 
interval was also needed for Pi/p 2 , we can use Feiller’s method described in 
Section 3.10. Noting the relationship between confidence intervals and tests of 
hypotheses, one can construct a test for the hypothesis p,/p 2 = C. 

Illustrative Example 

Consider the data in Table 3.11 and the estimation of the production function 
(4.24). In Section 6.4 we presented estimates of the equation assuming that the 
errors are AR(1). This was based on a DW test statistic of 0.86. Suppose that 
we estimate an equation of the form (6.26). The results are as follows (all vari¬ 
ables in logs; figures in parentheses are standard errors): 

X, = -2.254 + 0.884L, + 0.710AT, + 0.489AT ( _, 

(0.530) (0.139) (0.152) (0.120) 

- 0.073L,_, - 0.541/sT,_, RSS 0 = 0.01718 

(0.252) (0.150) 

Under the assumption that the errors are AR(1), the residual sum of squares, 
obtained from the Hildreth-Lu procedure we used in Section 6.4 is: RSS, = 
0.02635. 

Since we have two slope coefficients, we have two restrictions of the form 
(6.27). Note that for the general dynamic model we are estimating six parame¬ 
ters (a and five p’s). For the serial correlation model we are estimating four 
parameters (a, two p’s, and p). We will use the likelihood ratio test (LR), which 
is based on (see the appendix to Chapter 3) 

/RSSoV' 2 

VrssJ 

and -2 log ( A has a x 2 -distribution with d.f. 2 (number of restrictions). In our 
example 

-2log>- -39 - 16.7 

which is significant at the 1% level. Thus the hypothesis of a first-order auto¬ 
correlation is rejected. Although the DW statistic is significant, this does not 
mean that the errors are AR(1). 

,2 A. W. Gregory and M. R. Veall, “On Formulating Wald Tests of Non-linear Restrictions,” 
Econometrica, November 1985. The authors confirm by Monte Carlo studies and an empirical 
example that these differences can be substantial. See also A. W. Gregory and M. R. Veall, 
“Wald Tests of Common Factor Restrictions,” Economics Letters, Vol. 22, 1986, pp. 203-208. 
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*6.10 Trends and Random Walks 


Throughout our discussion we have assumed that E(u,) = 0 and var (u,) = a 2 
for all t, and co v(u„ «,_*) = cr 2 p k for all t and k, where p* is serial correlation 
of lag k (this is simply a function of the lag k and does not depend on t). If these 
assumptions are satisfied, the series u, is called covariance stationary (covari¬ 
ances are constant over time) or just stationary. Many economic time series 
are clearly nonstationary in the sense that the mean and variance depend on 
time, and they tend to depart ever further from any given value as time goes 
on. If this movement is predominantly in one direction (up or down) we say 
that the series exhibits a trend. More detailed discussion of the topics covered 
briefly here can be found in Chapter 14. 

Nonstationary time series are frequently de-trended before further analysis 
is done. There are two procedures used for de-trending. 

1. Estimating regressions on time. 

2. Successive differencing. 

In the regression approach it is assumed that the series y, is generated by the 
mechanism 


y, = fit) + u, 

where/(/) is the trend and u, is a stationary series with mean zero and variance 
07 ,. Let us suppose that f(t) is linear so that we have 

y, = a + pt + u, (6.29) 

Note that the trend-eliminated series is &„ the least squares residuals that sat¬ 
isfy the relationship 2 = 0 and 2) tu, = 0. If differencing is used to elimi¬ 

nate the trend we get Ay, = y, — y ,_, = p + u, — We have to take a first 
difference again to eliminate (3 and we get A 2 y, = A 2 u, = u, — 2w,_i + u,_ 2 as 
the de-trended series. 

On the other hand, suppose we assume that y, is generated by the model 

Jr - J,-i = P + e, (6.30) 

where e, is a stationary series with mean zero and variance a 2 . In this case the 
first difference of y, is stationary with mean p. This model is also known as the 
random-walk model. Accumulating y, starting with an initial value y 0 we get 
from (6.30) 


t 

y, = jo + P' + 2 e, 

i~ * 


(6.31) 
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which has the same form as (6.29) except for the fact that the disturbance is 
not stationary, it has variance ter 2 that increases over time. Nelson and Plosser 33 
call the model (6.29) trend-stationary processes (TSP) and model (6.30) differ¬ 
ence-stationary processes (DSP). Both the models exhibit a linear trend. But 
the appropriate method of eliminating the trend differs. To test the hypothesis 
that a time series belongs to the TSP class against the alternative that it belongs 
to the DSP class, Nelson and Plosser use a test developed by Dickey and 
Fuller. 34 This consists of estimating the model 

y, = a + py,^i + (3/ + e, (6.32) 

which belongs to the DSP class if p = 1, (3 = 0 and the TSP class if |p| < 1. 
Thus we have to test the hypothesis p = 1, (3 = 0 against |p| < 1. The problem 
here is that we cannot use the usual least squares distribution theory when 
p = 1. Dickey aftd Fuller show that the least squares estimate of p is not dis¬ 
tributed around unity under the DSP hypothesis (that is, the true value p = 1) 
but rather around a value less than one. However, the negative bias diminishes 
as the number of observations increases. They tabulate the significance points 
for testing the hypothesis p = 1 against |p| < 1. Nelson and Plosser applied the 
Dickey-Fuller test to a wide range of historical time series for the U.S. econ¬ 
omy and found that the DSP hypothesis was accepted in all cases, with the 
exception of the unemployment rate. They conclude that for most economic 
time series the DSP model is more appropriate, and that the TSP model would 
be the relevant one only if we assume that the errors u, in (6.29) are highly 
autocorrelated. 

The problem of testing the hypothesis p = 1 in the first order autoregressive 
equation of the form 


y, = a + py,- i + u, 

is called “testing for unit roots.” There is an enormous literature on this prob¬ 
lem but one of the most commonly used tests is the Dickey-Fuller test. The 
standard expression for the large sample variance of the least squares estimator 
p is (1 - p 2 )/T which would be zero under the null hypothesis. Hence, one 
needs to derive the limiting distribution of p under H 0 , p = 1 to apply the test. 

For testing the hypothesis p = 1, p = 0 in (6.32) Dickey and Fuller 35 suggest 
a LR test, derive the limiting distribution and present tables for the test. The 
F -values are much higher than those in the usual F-tables. For instance, the 
5% significance values from the tables presented in Dickey and Fuller, and 

”C. R. Nelson and C. I. Plosser, “Trends and Random Walks in Macroeconomic Time Series: 
Some Evidence and Implications,” Journal of Monetary Economics, Vol. 10, 1982, pp. 139- 
162. 

,4 D. A. Dickey and W. A. Fuller. “Distribution of the Estimators for Autoregressive Time- 
Series with a Unit Root,” Journal of the American Statistical Association, Vol. 74, 1979, pp. 
427-431. 

,5 D. A. Dickey and W. A. Fuller, “Likelihood Ratio Statistics for Autoregressive Time Series 
with a Unit Root,” Econometrica, Vol. 49, No. 4, 1981, pp. 1057-1072. See tables on p. 1063. 
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the corresponding F-values from the standard F-tables (when the numerator 
d.f. is 2 as in this test) are as follows: 


Sample 
Size n 

F-Ratio from 
Dickey-Fuller 

F-Ratios from 
Standard F-tables" 

25 

7.24 

3.42 

50 

6.73 

3.20 

100 

6.49 

3.10 

00 

6.25 

3.00 


"d f. for denominator = n — 3. 


As an illustration consider the example given by Dickey and Fuller. 36 For the 
logarithm of the quarterly Federal Reserve Board Production Index 1950-1 
through 1977-4 they assume that the time series is adequately represented by 
the model: 


y, = Po + IV + OW-I + « 2 (y,_, - y,_ 2 ) + e, (6.33) 

where e, are IN(0, a 2 ) random variables. The ordinary least squares estimates 
are: 

y, ~ 0.52 + 0.00120/ - 0.119y,_, + 0.498(y,_, - y,_ 2 ) 

/ (0 15) (0 00034) (0 033) (0.081) 

RSS = 0.056448 

y, ~ y,-\ = 0.0054 + 0.447(y,._, - y,_ 2 ) 

(0 0025) (0 083) 

RSS = 0.063211 

where RSS denotes the residual sum of squares and the numbers in parentheses 
are the “standard errors” as output from the usual regression program. 

The F-test for the hypothesis (3, = 0, a, = 1 is 

„ (0.063211 - 0.056448)/2 

Jr = ---•— = 6.34 

0.056448/106 

If we use the standard F-tables this F-ratio is significant at even the 1% signif¬ 
icance level. But the F-ratio tabulated by Dickey and Fuller is 6.49 for the 5% 
significance level. Thus, the hypothesis that the second order autoregressive 
process (6.33) has a unit root is accepted at the 5% significance level, though it 
is rejected at the 10 % significance level. 


Spurious Trends 

If P = 0 in equation (6.30) the model is called a trendless random walk or a 
random walk with zero drift. However, from equation (6.31) note that even 
though there is no trend in the mean, there is a trend in the variance. Suppose 


“Dickey and Fuller, “Likelihood Ratio Statistics,” pp. 1070-1071. 
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that the true model is of the DSP true with (3 = 0. What happens if we estimated 
a TSP type model? That is, the true model is one with no trend in the mean but 
only a trend in the variance, and we estimate a model with a trend in the mean 
but no trend in the variance. It is intuitively clear that the trend in the variance 
will be transmitted to the mean and we will find a significant coefficient for / 
even though in reality there is no trend in the mean. How serious is this prob¬ 
lem? Nelson and Kang 37 analyze this. They conclude that: 

V 

1. Regression of a random walk on time by least squares will produce R 2 
values of around 0.44 regardless of sample size when, in fact, the mean 
of the variable has no relationship with time whatsoever. 

2. In the case of random walks with drift, that is (3 ^ 0, the R 1 will be higher 
and will increase with the sample size, reaching one in the limit regardless 
of the value of (3. 

3. The residual from the regression on time which we take as the de-trended 
series, has on the average only about 14% of the true stochastic variance 
of the original series. 

4. The residuals from the regression on time are also autocorrelated being 
roughly (1 - 10/AO at lag one, where N is the sample size. 

5. Conventional /-tests to test the significance of some of the regressors are 
not valid. They tend to reject the null hypothesis of no dependence on 
time, with very high frequency. 

6 . Regression of one random walk on another, with time included for trend, 
is strongly subject to the spurious regression phenomenon. That is, the 
conventional /-test will tend to indicate a relationship between the vari¬ 
ables when none is present. 

The main conclusion is that using a regression on time has serious conse¬ 
quences when, in fact, the time series is of the DSP type and, hence, differ¬ 
encing is the appropriate procedure for trend elimination. Plosser and Schwert 38 
also argue that with most economic time series it is always best to work with 
differenced data rather than data in levels. The reason is that if indeed the data 
series are of the DSP type, the errors in the levels equation will have variances 
increasing over time. Under these circumstances many of the properties of 
least squares estimators as well as tests of significance are invalid. On the other 
hand, suppose that the levels equation is correctly specified. Then all differ¬ 
encing will do is produce a moving average error and at worst ignoring it will 
give inefficient estimates. For instance, suppose that we have the model 

y, = a + (3x, + yt + u, 


37 C. R. Nelson and H. Kang, “Pitfalls in the Use of Time as an Explanatory Variable in Regres¬ 
sion,” Journal of Business and Economic Statistics. Vol. 2. January 1984, pp. 73-82. 

3S C. I. Plosser and G. W. Schwert, “Money, Income and Sunspots: Measuring Economic Re¬ 
lationships and the Effects of Differencing,” Journal of Monetary Economics, Vol. 4, 1978, pp. 
637-660. 
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where u, are independent with mean zero and common variance a 2 . If we dif¬ 
ference this equation, we get 

Ay, = (3 • Ax, + -y + v, 

where the error v, = An, = n, - n,„, is a moving average, and, hence, not 
serially independent. But estimating the first difference equation by least 
squares still gives us consistent estimates. Thus, the consequences of differ¬ 
encing when it is not needed are much less serious than those of failing to 
difference when it is appropriate (when the true model is of the DSP type). 

In practice, it is best to use the Dickey-Fuller test to check whether the data 
are of DSP or TSP type. Otherwise, it is better to use differencing and regres¬ 
sions in first differences, rather than regressions in levels with time as an extra 
explanatory variable. 

Differencing and Long-Run Effects: 

The Concept of Cointegration 

One drawback of the procedure of differencing is that it results in a loss of 
valuable “long-run information” in the data. Recently, the concept of cointe¬ 
grated series has been suggested as one solution to this problem. 39 First, we 
need to define the term “cointegration. ” Although we do not need the assump¬ 
tion of normality and independence, we will define the terms under this as¬ 
sumption. 

If e, are IN(0, cr 2 ) we say e, are 1(0) that is, integrated of order zero. (More 
generally, e, is a stationary process. This is discussed in Chapter 14 ' 

If y, follow a random walk model, that is, 

y, = y,-i + e, 

then we get by successive substitution, 

/-I 

y, = 2 E t-j if y 0 = 0 

J = 0 

Thus, y, is a summation of b,, and 

Ay, = e, 

which is 1(0). We say in this case that y, is 1(1) [integrated to order one]. If y, is 
1(1) and we add to this z, which is 1(0), then y, + z, will be 1(1). When we specify 
regression models in time series, we have to make sure that the different vari¬ 
ables are integrated to the same degree. Otherwise, the equation does not make 
sense. For instance, if we specify the regression model: 

y, = fix, + a, (6.34) 


39 A reference which is, however, quite technical for our purpose, is R. F. Engle and C. W. J 
Granger, “Co-Integration and Error Correction: Representation, Estimation and Testing," 
Econometric a, Vol. 55, March 1987, pp. 251-276. 
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and we say that u, ~ IN(0, cr 2 ), that is u, is 1(0), we have to make sure that y, 
and x, are integrated to the same order. For instance, if y, is 1(1) and x, is 1(0) 
there will not be any (3 that will satisfy the relationship (6.34). Suppose y, is 
1 (1) and x, is 1(1); then if there is a nonzero p such that y, — Pa, is 1(0) then y, 
and x, are said to be cointegrated. 

Suppose that y, and x, are both random walks, so that they are both 1(1). 
Then an equation in first differences of the form 

Ay, = a Ax, + \(y, - Px,) + v, (6.35) 

is a valid equation. Since Ay„ Ax„ (y, - px,) and v, are all 1(0). Equation (6.34) 
is considered a long-run relationship between y, and x, and equation (6.35) de¬ 
scribes short-run dynamics. Engle and Granger suggest estimating (6.34) by 
ordinary least squares, obtaining the estimator p of p and substituting it in 
equation (6.35) to estimate the parameters a and X. This two-step estimation 
procedure, however, rests on the assumption that y, and x, are cointegrated. It 
is, therefore, important to test for cointegration. Engle and Granger suggest 
estimating (6.34) by ordinary least squares, getting the residual u, and then 
applying the Dickey-Fuller test (or some other test 40 ) based on u t . What this 
test amounts to is testing the hypothesis p = 1 in 

u, = pn,„, + e, 

that is, testing the hypothesis 

H 0 : u, is 1(1) 

In essence, we are testing the null hypothesis that y, and x, are not co-inte- 
grated. Note that y, is 1(1) and x, is 1(1), so we are trying to see that u, is not 

KD. 

As shown by Bewley and also by Wickens and Breusch, 41 the two-step esti¬ 
mation procedure suggested by Engle and Granger of first estimating the long- 
run parameter p and then estimating the short-run parameters a and X in equa¬ 
tion (6.35) is unnecessary. They argue that one should estimate both the long- 
run and short-run parameters simultaneously and one would get more efficient 
estimates of the long-run parameter p by this procedure. Dividing (6.35) by X 
and rearranging we get 

y, = Px, + ^Ay, - ^Ax, - ^ (6.36) 


“Since u, are estimated residuals, the Dickey-Fuller tables have to be adjusted. An alternative 
test that is also often suggested for testing unit roots is the Sargan-Bhargava test. See J. D. 
Sargan and A. Bhargava, “Testing Residuals from Least Squares Regression for Being Gener¬ 
ated by the Gaussian Random Walk,” Econometrica, Vol. 51, 1983. pp. 153-174. This is a 
Durbin-Watson type test with significance levels corrected. 

4I R. A. Bewley, “The Direct Estimation of the Equilibrium Response in a Linear Dynamic 
Model,” Economics Letters, Vol. 3, 1979, pp. 357-361. M. R. Wickens and T. S. Breusch, 
“Dynamic Specification, the Long Run and the Estimation of Transformed Regression 
Models,” Economic Journal, 1988 (supplement), pp. 189-205. 
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Since Ay, will be correlated with the error u„ equation (6.36) has to be esti¬ 
mated by the instrumental variable method. The coefficients of Ay, and Ax, 
describe the short-run dynamics. Note that if y, and x, are 1(1), then Ay, and 
Ax, are like w„ 1(0), that is, they are of a lower order. Wickens and Breusch 
show that mis specification of the short-run dynamics does not have much of 
an effect on the estimation of the long-run parameters. For instance, in equa¬ 
tion (6.36) even if Ax, is omitted, the estimate of the parameter p will still be 
consistent. The intuitive reasoning behind this is that Ay, and Ax, are of a lower 
order than y, and x,. 

*6.11 ARCH Models and Serial Correlation 


We saw in Section 6.9 that a significant DW statistic can arise through a number 
of misspecifications. We will now discuss one other source. This is the ARCH 
model suggested by Engle 42 which has, in recent years, been found useful in 
the analysis of speculative prices. ARCH stands for “autoregressive condi¬ 
tional heteroskedasticity.” 

When we write the simple autoregressive model 

y, = ky,_, + E, e, ~ IN(0, cr 2 ) 

we are saying that the conditional mean £'(y,|y,„ j) = ky, , depends on t but the 
conditional variance var(y,|y, ,) = cr 2 is a constant. The unconditional mean of 
y, is zero and the unconditional variance is cr 2 /(l - X 2 ). 

The ARCH model is a generalization of this, in that the conditional variance 
is also made a function of the past. If the conditional density /(y,|z, ,) is normal, 
a general expression for the ARCH model is 

~ %(z,-i), h(z,_,)) 

To make this operational, Engle specifies the conditional mean g(z,_ i) as a lin¬ 
ear function of the variables z,_, and h as 

h, = a 0 + a,£? , + a 2 e ?-2 + • • • + a p e 2 _ p 
where e, = y, — g,. In the simplest case we can consider the model 

y, = Xy,_j + px, + e, e, ~ IN(0, h,) (6.37) 

h, = var e, = a 0 + a,e?„, (6.38) 

Note that e, are not autocorrelated. But the fact that the variance of e, de¬ 
pends on e 2 „, gives a misleading impression of there being a serial correlation. 
For instance, suppose that in (6.37) X = 0, that is, we do not have y,_, as an 
explanatory variable. If we estimate that equation by OLS we will find a sig¬ 
nificant DW statistic because of the ARCH effect given by (6.38). The situation 


42 R. F. Engle, “Autoregressive Conditional Heteroscedasticity with Estimates of the Variance 
of C. K. Inflation,” Econometricci, Vol. 50, 1982, pp. 987-1007. 
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is similar to the one we discussed in Section 6.10 where a trend in the variance 
was transmitted to a trend in the mean. In this case a simple test for the ARCH 
effect, that is, a test for the hypothesis a, = 0 in (6.38) is to get the OLS 
residuals e, and regress ej on k] , (with a constant term) and testing whether 
the coefficient of , is zero. An LM (Lagrangian multiplier test) is to use nR 1 
as x 2 with one d.f. This would enable us to see whether the significant DW test 
statistic is due to serial correlation in e, or due to the ARCH effect. Many 
empirical studies have found significant ARCH effects. 

The estimation of the ARCH model can be carried out by an iterative pro¬ 
cedure. First, we estimate (6.37). We then get estimates of a 0 and a, in (6.38) 
by regressing t:f on ef . ,. Now we estimate (6.37) as a heteroskedastic regres¬ 
sion model, since we have an estimate of h,. This process can be repeated until 
convergence. 

There are, however, problems with this simple procedure that we have ig¬ 
nored. We might get estimates of a, less than zero or greater than one. These 
problems as well as the computation of the asymptotic variances and covari¬ 
ances are discussed in Engle’s paper. The purpose of our discussion is to point 
out one more source for a significant DW test statistic when, in fact, there is 
no serial correlation. 


Summary 


1. Most economic data consist of time series and there is very often a cor¬ 
relation in the errors corresponding to successive time periods. This is the 
problem of autocorrelation. 

2. The Durbin-Watson (DW) test is the most often used to test for the pres¬ 
ence of autocorrelation. If this test detects the presence of autocorrelation, it 
is customary to transform the data on the basis of the estimated first-order 
autocorrelation and use least squares with the transformed data. 

3. There are several limitations to this procedure. These limitations (dis¬ 
cussed in Section 6.9) are: 

(a) The serial correlation might be of a higher order. 

(b) The serial correlation might be due to omitted variables. 

(c) The serial correlation might be due to the noninclusion of lagged values of 
the explained and explanatory variable, that is, due to misspecification of 
the dynamic process. 

4. Very often, simple solutions are suggested for handling the serial corre¬ 
lation problem, such as estimation in first differences. The issue of estimation 
of equations in levels versus first differences is discussed in Section 6.3 and 
also in Section 6.10. Other solutions are discussed in Section 6.4. 

5. There have been some extensions and further tables prepared for the DW 
test. These extensions are outlined in Section 6.6. 

6. The DW test is not applicable if there are lagged dependent variables in 
the model. Durbin suggested an alternative test, known as Durbin’s fi-test. This 
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test is explained and illustrated in Section 6.7. Some problems with its use are 
also illustrated there. This test again, is for first-order autocorrelation only. 

7. A general test which is, however, asymptotic, is the LM test. This test 
can be used for any specified order of the autocorrelation. It can be applied 
whether there are lagged dependent variables or not. It can be used with stan¬ 
dard regression packages. It is based on omitted variables. This test is dis¬ 
cussed in Section 6.8. It consists of two steps: 

(a) First estimate the equation by ordinary least squares and get the residual 

u,. 

(b) Now introduce appropriate lags of u, in the original equation and reestimate 
it by least squares. Test that the coefficients of the lagged u ,'s are zero using 
the standard tests. 

The LM test has not been illustrated with an example. This is left as an exer¬ 
cise. Many of the data sets presented in the book are time-series data, and 
students can use these to apply the LM test. 

8. The effect of autocorrelated errors on least squares estimators are: 

(a) If there are no lagged dependent variables among the explanatory vari¬ 
ables, the estimators are unbiased but inefficient. However, the estimated 
variances are biased, sometimes substantially. These problems are dis¬ 
cussed in Section 6.5 and the biases are presented for some simple cases. 

(b) If there are lagged dependent variables among the explanatory variables, 
the least squares estimators are not even consistent (see Section 6.7). In 
this case the DW test statistic is biased as well. This is the reason for the 
use of Durbin’s h- test. 

9. Obtaining a significant DW test statistic does not necessarily mean that 
we have a serial correlation problem. In fact, we may not have a serial corre¬ 
lation problem and we may be applying the wrong solution. For this purpose 
Sargan suggested that we first test for common factors and then apply tests for 
serial correlation if there is a common factor. This argument is explained at the 
end of Section 6.9 and illustrated with an example. 

10. Economic time series can conveniently be classified as belonging to the 
DSP class or TSP class. The appropriate procedure for trend elimination 
(whether to use differences or regressions on time) will depend on this classi¬ 
fication. One can apply the Dickey-Fuller test (or Sargan-Bhargava test) to 
test whether the time series is of the DSP type or TSP type. Most economic 
time series, however, are of the DSP type and, thus, estimation in first differ¬ 
ences is appropriate. However, differencing eliminates all information on the 
long-run properties of the model. One suggestion that has been made is to see 
whether the time series are cointegrated. If this is so, then both long-run and 
short-run parameters can be estimated (either separately or jointly). 

11. Sometimes even though the errors in the equation are not autocorrelated, 
the variance of the error term can depend on the past history of errors. In such 
models, called ARCH models, one can find a significant DW test statistic even 
though there is no serial correlation in the errors. A test for the ARCH effect 
will enable us to judge whether the observed serial correlation is spurious. 
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Exercises 


1. Explain the following. 

(a) The Durbin-Watson test. 

(b) Estimation with quasi first differences. 

(c) The Cochrane-Orcutt procedure. 

(d) Durbin’s h- test. 

(e) Serial correlation due to mis specified dynamics. 

(f) Estimation in levels versus first differences. 

2. Use the Durbin-Watson test to test for serial correlation in the errors in 
Exercises 17 and 19 at the end of Chapter 4. 

3. I am estimating an equation in which y,_, is also an explanatory variable. I 
get the following results. 


y, = 2.7 + 0.4x, + 0.9y,_, 

(0 4 ) (0 06 ) 


R 2 = 0.98 
DW = 1.9 


I find that the R 2 is very high and the Durbin-Watson statistic is close to 2, 
showing that there is no serial correlation in the errors. My friend tells me 
that even if the R 2 is high, this is a useless equation. Why is this a useless 
equation? 

4. Examine whether the following statements are true or false. Give an expla¬ 
nation. 

(a) Serial correlation in the errors u leads to biased estimates and biased 
standard errors when the regression equation y = (3x + u is estimated 
by ordinary least squares. 

(b) The Durbin-Watson test for serial correlation is not applicable if the 
errors are heteroskedastic. 

(c) The Durbin-Watson test for serial correlation is not applicable if there 
are lagged dependent variables as explanatory variables. 

(d) An investigator estimating a demand function in levels and first differ¬ 
ences obtained R 2 's of 0.90 and 0.80, respectively. He chose the equa¬ 
tion in levels because he got a higher R 2 . This is a valid reason for 
choosing between the two models. 

(e) Least squares techniques when applied to economic time-series data 
usually yield biased estimates because many economic time series are 
autocorrelated. 

(f) The Durbin-Watson test can be used to describe whether the errors 
in a regression equation based on time-series data are serially inde¬ 
pendent. 

(g) The fact that the Durbin-Watson statistic is significant does not nec¬ 
essarily mean that there is serial correlation in the errors. One has to 
apply some other tests to come to this conclusion. 

(h) Consider the model y, = ay,_ , + (3x, + u„ where the errors are au¬ 
toregressive. Even if the OLS method gives inconsistent estimates of 
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the parameters, we can still use the equation for purposes of predic¬ 
tion if the evolution of x, during the prediction period is the same as 
in the estimation period. 

(i) Consider the model 

y, = a + (3x, + u, 

u, = p«,_! + e, 0 < p < 1 

e, are IN(0, cr 2 ). By regressing Ay, on Ax„ it is possible to get more 
efficient estimates of p than by regressing y, on x r 

(j) The Durbin-Watson test is a useless test because it is inapplicable in 
almost every situation that we encounter in practice. 

5. The phrase “since the model contains a lagged dependent variable, the DW 
statistic is unreliable” is frequently seen in empirical work. 

(a) What does this phrase mean? 

(b) Is there some way to get around this problem? 

6. Apply the LM test to test for first-order and second-order serial correlation 
in errors for the estimation of some multiple regression models with the data 
sets presented in Chapter 4. In each case compare the results with those 
obtained by using the DW test and Durbin’s fi-test if there are lagged depen¬ 
dent variables in the explanatory variables. 

7. Apply Sargan’s common factor test to check that the significant serial cor¬ 
relation is not due to misspecified dynamics. 

8. In the case of data with housing starts in Table 4.10 illustrate the use of 
fourth-order autocorrelation using the DW test and the LM test. 
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7.1 Introduction 


Very often the data we use in multiple regression analysis cannot give decisive 
answers to the questions we pose. This is because the standard errors are very 
high or the f-ratios are very low. The confidence intervals for the parameters 
of interest are thus very wide. This sort of situation occurs when the explana¬ 
tory variables display little variation and/or high intercorrelations. The situa¬ 
tion where the explanatory variables are highly intercorrelated is referred to as 
multicollinearity. When the explanatory variables are highly intercorrelated, it 
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becomes difficult to disentangle the separate effects of each of the explanatory 
variables on the explained variable. The practical questions we need to ask is 
how high these intercorrelations have to be to cause problems in our inference 
about the individual parameters and what we can do about this problem. We 
argue in the subsequent sections that high intercorrelations among the explan¬ 
atory variables need not necessarily create a problem and some solutions often 
suggested for the multicollinearity problem can actually lead us on a wrong 
track. The suggested cures are sometimes worse than the disease. 

The term “multicollinearity” was first introduced in 1934 by Ragnar Frisch' 
in his book on confluence analysis and referred to a situation where the vari¬ 
ables dealt with are subject to two or more relations. In his analysis there was 
no dichotomy of explained and explanatory variables. It was assumed that all 
variables were subject to error and given the sample variances and covari¬ 
ances, the problem was to estimate the different linear relationships among the 
true variables. The problem was thus one of errors in variables. We will, how¬ 
ever, be discussing the multicollinearity problem as it is commonly discussed 
in multiple regression analysis, namely, the problem of high intercorrelations 
among the explanatory variables. 

Multicollinearity or high intercorrelations among the explanatory variables 
need not necessarily be a problem. Whether or not it is a problem depends on 
other factors, as we will see presently. Thus the multicollinearity problem can¬ 
not be discussed entirely in terms of the intercorrelations among the variables. 
Further, different parametrizations of the variables will give different magni¬ 
tudes of these intercorrelations. This point is explained in the next section with 
some examples. Most of the discussions of the multicollinearity problem and 
its solutions are based on criteria based on the intercorrelations between the 
explanatory variables. However, this is an incorrect approach, as will be clear 
from the examples given in the next section. 


7.2 Some Illustrative Examples 


We first discuss some examples where the intercorrelationships between the 
explanatory variables are high and study the consequences. 

Consider the model y = + p 2 x 2 + u. Ifx 2 = 2x u we have 

y = P,x, + p 2 (2x,) + it = ((3, + 2p 2 )x, + u 

Thus only (p, + 2p 2 ) would be estimable. We cannot get estimates of p, and 
P 2 separately. In this case we say that there is “perfect multicollinearity,” be¬ 
cause x, and x 2 are perfectly correlated (with r\ 2 = 1). In actual practice we 
encounter cases where i 2 is not exactly 1 but close to 1. 


'Ragnar Frisch, Statistical Confluence Analysis by Means of Complete Regression Systems , 
Publication 5 (Oslo: University Institute of Economics, 1934). 
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As an illustration, consider the case where 

5„ = 200 S h = 350 
5, 2 = 150 S 2y = 263 
S 22 = 113 

so that the normal equations are 

2000 , + 15002 = 350 
1500 , + 1130 2 = 263 

The solution is 0, = 1 and 0 2 = 1. Suppose that we drop an observation and 
the new values are 

5„ = 199 S u = 347.5 
5,2 = 149 S 2v = 261.5 
S 22 = 112 

Now when we solve the equations 

1990, + 1490 2 = 347.5 
1490, + 1120 2 = 261.5 


we get 0, = -i, 02 = 3. 

Thus very small changes in the variances and covariances produce drastic 
changes in the estimates of the regression parameters. It is easy to see that the 
correlation coefficient between the two explanatory variables is given by 


r 


2 

12 


(150) 2 

200(113) 


0.995 


which is very high. 

In practice, addition or deletion of observations would produce changes in 
the variances and covariances. Thus one of the consequences of high correla¬ 
tion between x, and x 2 is that the parameter estimates would be very sensitive 
to the addition or deletion of observations. This aspect of multicollinearity can 
be checked in practice by deleting or adding some observations and examining 
the sensitivity of the estimates to such perturbations. 

One other symptom of the multicollinearity problem that is often mentioned 
is that the standard errors of the estimated regression coefficients will be very 
high. However, high values of r} 2 need not necessarily imply high standard 
errors, and conversely, even low values of r\ 2 can produce high standard errors. 
In Section 4.3 we derived the standard errors for the case of two explanatory 
variables. There we derived the following formulas: 


HP,) 


or 


*Sll(l r 12) 


or 


5 22 (1 r 2 2 ) 


(7.1) 


V(0 2 ) = 


(7.2) 
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mid 


~ v(& " <7 - 31 

where ct 2 is the variance of the error term. Thus the variance of (3, will be high 
if 


1. ct 2 is high. 

2. 5,, is low. 

3. r n is high. 

Even if r 2 n is high, if ct 2 is low and S n high, we will not have the problem of 
high standard errors. On the other hand, even if r\ 2 is low, the standard errors 
can be high if ct 2 is high and ,S' M is low (i.e., there is not sufficient variation in 
jc,). What this suggests is that high values of r 2 n do not tell us anything whether 
we have a multicollinearity problem or not. 

When we have more than two explanatory variables, the simple correlations 
among them become all the more meaningless. As an illustration, consider the 
following example with 20 observations on x,, x 2 , and x 3 . 


x, = (1, 1, 1, 1, 1, 0, 0, 0, 0, 0, and 10 zeros) 

, x 2 = (0, 0, 0, 0, 0, 1, 1, 1, 1, 1, and 10 zeros) 

= (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, and 10 zeros) 

Obviously, x 3 = x, + x 2 and we have perfect multicollinearity. But we can 
easily see that r l2 = -' and r l3 = r 23 = 1/V3 — 0.59, and thus the simple 
correlations are not high. In the case of more than two explanatory variables, 
what we have to consider are multiple correlations of each of the explanatory 
variables with the other explanatory variables. Note that the standard error 
formulas corresponding to (7.1) and (7.2) are 


V(p,) 


cr 


.S'„( l - R]) 


(7.4) 


where ct 2 and S„ are defined as before in the case of two explanatory variables 
and R 2 represents the squared multiple correlations coefficient between x, and 
the other explanatory variables. Again, it is easy to see that V((3,) will be high 
if 


1 . ct 2 is high. 

2. S„ is low. 

3. R 2 is high. 

Thus high R 2 is neither necessary nor sufficient to get high standard errors and 
thus multicollinearity by itself need not cause high standard errors. 

There are several rules of thumb that have been suggested in the literature 
to detect when multicollinearity can be considered a serious problem. FW in- 
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stance, Klein says 2 : “Intercorrelation of variables is not necessarily a problem 
unless it is high relative to the overall degree of multiple correlation.” By 
Klein’s rule multicollinearity would be regarded as a problem only if R; < 
R] where R; is the squared multiple correlation coefficient between y and the 
explanatory variables, and R 2 is as defined earlier. However, note that even if 
Rl < Rj we can still have significant partial correlation coefficients (i.e., signif¬ 
icant /-ratios for the regression coefficients). For example, suppose that the 
correlations between y, x, and x 2 are given by 



y 

x, 

*2 

y 

1.00 

0.95 

0.95 

x t 

0.95 

1.00 

0.97 

x 2 

0.95 

0.97 

1.00 


Then it can be verified that R 2 i2 = 0.916 and r} 2 = 0.941. Thus R 2 < 
rj 2 . But r\ % . 2 = r \ 2 ., — 0.14. Since the relationship between the /-ratio and par¬ 
tial r 2 is given by (see Section 4.5) 


t 1 + degrees of freedom 

we will get /-values greater than 3 if the number of observations is greater than 

60. 

We can summarize the previous discussion as follows: 

1. If we have more than two explanatory variables, we should use R 2 values 
to measure the degree of intercorrelations among the explanatory vari¬ 
ables, not the simple correlations among the variables. 3 

2. However, whether multicollinearity is a problem or not for making infer¬ 
ences on the parameters will depend on other factors besides R 2 ’ s, as is 
clear from equation (7.4). What is relevant is the standard errors and 
/-ratios. Of course, if R 2 is low, we would be better off. But this argument 
is only a poor consolation. It is not appropriate to make any conclusions 
about whether multicollinearity is a problem or not just on the basis of 
R 2 ' s. The Rj's are useful only as a complaint. Moreover, the Rf's depend 
on the particular parametrization adopted, as we discuss in the next Sec¬ 
tion. 


2 L. R. Klein, An Introduction to Econometrics (Englewood Cliffs, N.J.: Prentice Hall, 1962), 
p. 101. 

3 Of course, some other measures based on the eigenvalues of the correlation matrix have been 
suggested in the literature, but a discussion of this is beyond our scope. Further, all these 
measures are only a “complaint” that the explanatory variables are highly intercorrelated; they 
do not tell whether the problem is serious. 
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7.3 Some Measures of Multicollinearity 


It is important to be familiar with two measures that are often suggested in the 
discussion of multicollinearity: the variance-inflation factor (VIF) and the con¬ 
dition number. 

The VIF is defined as 


VIF(P.) = 


_ 1 _ 

1 - R? 


where /? 2 is the squared multiple correlation coefficient between x, and the other 
explanatory variables. Looking at the formula (7.4), we can interpret VIF(f3,) 
as the ratio of the actual variance of 0, to what the variance of 0, would have 
been if x, were to be uncorrelated with the remaining x’s. Implicitly, an ideal 
situation is considered to be one where the x’s are all uncorrelated with each 
other and the VIF, compares the actual situation with an ideal situation. This 
comparison is not very useful and does not provide us guidance as to what to 
do with the problem. It is more a complaint that things are not ideal. Also, 
looking at formula (7.4), as we have discussed earlier, 1/(1 — R}) is not the only 
factor determining whether multicollinearity presents a problem in making in¬ 
ferences. 

Whereas the VIF, is something we compute for each explanatory variable 
separately, the condition number discussed by Raduchel 4 and Belsley, Kuh, 
and Welsch 5 is an overall measure. The condition number is supposed to mea¬ 
sure the sensitivity of the regression estimates to small changes in the data. It 
is defined as the square root of the ratio of the largest to the smallest eigenvalue 
of the matrix X'X of the explanatory variables. Eigenvalues are explained in 
the appendix to this chapter. For the two-variable case in Section 7.2 it is easily 
computed. We solve the equation 


(S„ - X)(S 22 - X) - S] 2 = 0 


(200 - X)(l 13 - X) - (150) 2 = 0 


X 2 - 313X + 100 = 0 

which gives X, = 312.68, X 2 = 0.32 as the required eigenvalues. The condition 
number = VX,/X 2 = 31.26. The closer the condition number is to 1, the better 
the condition is. 

Again, there are three problems with this: 

1. It looks at only the correlations among the explanatory variables and for¬ 
mula (7.4) shows that this is not the only relevant factor. 


4 W. J. Raduchel, “Multicollinearity Once Again,” Paper 205. Harvard Institute of Economic 
Research, Cambridge, Mass., 1971. 

5 D. Belsley, E. Kuh, and R. Welsch, Regression Diagnostics (New York: Wiley, 1980), 
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2. The condition number can change by a reparametrization of the vari¬ 
ables. For instance, if we define z, = x, + x 2 and z 2 = x x ~ x 2 , the 
condition number will change. In fact, it can be made equal to 1 with 
suitable transformations of the variables. 

3 . Even if such transformations of variables are not always meaningful 
(what does 2 apples + 3 oranges mean?), the condition number is merely 
a complaint that things are not ideal. 6 

In Section 7.4 we consider an example where transformations of variables 
are meaningful. However, even when they are not, the VIF and condition num¬ 
bers are only measures of how bad things are relative to some ideal situation, 
but the standard errors and 7-ratios will tell a better story of how bad things 
are. The condition number (CN) is actually a “complaint number.” 

The VIFs and condition number will be useful for dropping some variables 
and imposing parameter constraints only in some very extreme cases where 
R 2 — 1.0 or the smallest eigenvalue is very close to zero. In this case we esti¬ 
mate the model subject to some constraints on the paramters. This point is 
illustrated in Section 7.5 with an example. 

The mlajor aspect of the VIF and the condition number is that they look at 
only the intercorrelations among the explanatory variables. A measure that 
considers the correlations of the explanatory variable with the explained vari¬ 
able is Theil’s measure, 7 which is defined as 

k 

m = R 2 - 'ZiR 2 - R 2 ,) 

/** 1 

where R 2 = squared multiple correlation from a regression of y 
on x„ x 2 , .... x k 

R 2 , = squared multiple correlation from a regression of y on x u x 2 , . . . , 
x k with x, omitted 

The quantity ( R 2 - R 2 ,) is termed the “incremental contribution” to the 
squared multiple correlation by Theil. If jc„ x 2 , . . . , x k are mutually uncorre¬ 
lated, then m will be 0 because the incremental contributions all add up to R 1 . 
In other cases m can be negative as well as highly positive. This makes it dif¬ 
ficult to use it for any guidance. 

To see what this measure means and how it is related to the 7-ratios, let us 
consider the case of two explanatory variables. Following the notation in Sec¬ 
tion 4.6, we will write R 2 as Rj l2 . RL, are now just the squared simple corre¬ 
lations r 7 , and r] 2 . Thus 


m = Rj. n - (R 2 y . l2 - ^,) - (Rl. n - 4) 


6 E. E. Learner, “Model Choice and Specification Analysis,” in Z. Griliches and M. D. Intrilli- 
gator (eds.). Handbook of Econometrics, Vol. 1 (Amsterdam: North-Holland, 1983), pp. 286- 
330. 

7 H. Theil, Principles of Econometrics (New York: Wiley, 1971), p. 179. 
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The 7-ratios are related to the partial r 2 's, r; x 2 and r\ 2x . We also derived in 
Section 4.6 the relation that 

(1 - R 2 l2 ) = (1 - rj,)(l - r? 2 ,) 

= 1 - r’, - (1 - r\ x )r 2 n , 

Hence 

Win “ rl x ) = (1 - r 2 vX )r 2 y2X 

Thus 

m = (squared multiple correlation coefficient) 

- (weighted sum of the partial r 2 ’ s) 

This weighted sum is w x • r 2 2 , + w 2 ■ r 2 yX 2 , where w, = 1 - r 2 , and w 2 = 

1 - r 2 yl . If the partial r 2 ’s are all very low, m will be very close to multiple R 2 . 
In the earlier example that we gave to illustrate Klein’s measure of multicol- 
linearity, we had R 2 , n = 0.916, f y2 = 0.9025, and r 2 , 2 = r 2 2 , = 0.14. Thus 
Theil’s measure of multicollinearity is 

m = 0.916 - 2(1 - 0.9025)(0.14) 

= 0.888 

Of course, m is not zero. But is multicollinearity serious or not? One can never 
tell. If the number of observations is greater than 60, we will get significant 
/-ratios. Thus Theil’s measure is even less useful than VIF and condition num¬ 
ber. 

We have discussed several measures of multicollinearity and they are all of 
limited use from the practical point of view. As Learner puts it, they are all 
merely complaints that things are not ideal. The standard errors and 7-ratios 
give us more information about how serious things are. It is relevant to remem¬ 
ber formula (7.4) while assessing any measures of multicollinearity. 

Learner suggests some measures of multicollinearity based on the sensitivity 
of inferences to different forms of prior information. Since a discussion of these 
measures involves a knowledge of bayesian statistical inference and multivar¬ 
iate statistics, we have to omit these measures. 


7.4 Problems with Measuring 
Multicollinearity 

In Section 7.3 we talked of measuring multicollinearity in terms of the inter¬ 
correlations among the explanatory variables. However, there is one problem 
with this. The intercorrelations can change with a redefinition of the explana¬ 
tory variables. Some examples will illustrate this point. 
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Let us define 

C = real consumption per capita 

Y = real per capita current income 

Y P = real per capita permanent income 
Y T = real per capita transitory income 

Y = Yj + Y P and Y P and Y r are uncorrelated 
Suppose that we formulate the consumption function as 

C = aY + pFp f u (7.5) 

This equation can alternatively be written as 

C = afj + (« + P)F P + u (7.6) 

or 

C = (<x + p)F - pF r + u (7.7) 

All these equations are equivalent. However, the corelations between the ex¬ 
planatory variables will be different depending on which of the three equations 
is considered. In equation (7.5), since Y and Y,. are often highly correlated, we 
would say that there is high multicollinearity. In equation (7.6), since Y-, and 
Y P are uncorrelated, we would say that there is no multicollinearity. However, 
the two equations are essentially the same. What we should be talking about is 
the precision with which a and fj or (a + £>) are estimable. 

Consider, for instance, the following data. 8 

var(C) = 7.3 cov(C, Y) = 8.3 cov(F, Y P ) = 9.0 

var(F) = 10.0 cov(C, Y P ) = 8.0 cov(F, Y,) = 1.0 

var(Fp) = 9.0 cov(C, Y T ) = 0.3 cov( Y P , Y,) =0 by definition 

var( F,) =1.0 

For these data the estimation of equation (7.5) gives (all variables measured as 
deviations from their means) 

C = 0.30F + 0.59 Y P of, = 0.1 

(0 32) (0 33) 

(The figures in parentheses are standard errors.) One reason for the imprecision 
in the estimates is that F and Y P are highly correlated (the correlation coeffi¬ 
cient is 0.95). 


8 The data are those used in an illustrative example by Gary Smith, “An Example of Ridge 
Regression Difficulties,” Canadian Journal of Statistics, Vol. 8, No. 2, 1980, pp. 217-225. 
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For equation (7.6) the correlation between the explanatory variables is zero 
and for equation (7.7) it is 0.32. The least squares estimates of a and (3 are no 
more precise in equation (7.6) or (7.7). 

Let us consider the estimation of equation (7.6). We get 

C = 0.30F r + 0.89^ 

(0 32) (0.11) 

The estimate at (a + (3) is thus 0.89 and the standard error is 0.11. Thus a + 
(3 is indeed more precisely estimated than either a or (3. As for a, it is not 
precisely estimated even though the explanatory variables in this equation are 
uncorrelated. The reason is that the variance of Y T is very low [see formula 
(7.1)]. 

We can summarize the conclusions from this illustrative example as follows: 

1. It is difficult to define multicollinearity in terms of the correlations be¬ 
tween the explanatory variables because the explanatory variables can 
be redefined in a number of different ways and these can give drastically 
different measures of intercorrelations. In some cases, these redefinitions 
may not make sense, but in the example above involving measured in¬ 
come, permanent income, and transitory income, these redefinitions 
make sense. 

2 . Just because the explanatory variables are uncorrelated it does not mean 
that we have no problems with inference. Note that the estimate of a and 
its standard error are the same in equation (7.5) (with the correlation 
among the explanatory variables equal to 0.95) and in equation (7.6) (with 
the explanatory variables uncorrelated). 

3 . Often, though the individual parameters are not precisely estimable, 
some linear combinations of the parameters are. For instance, in our ex¬ 
ample, a + (3 is estimable with good precision. Sometimes, these linear 
combinations do not make economic sense. But at other times they do. 

We will present yet another example to illustrate some problems in judging 
whether multicollinearity is serious or not and also to illustrate the fact that 
even if individual parameters are not estimable with precision, some linear 
functions of the parameters are. 

In Table 7.1 we present data on C, Y, and L for the period from the first 
quarter of 1952 to the second quarter of 1961. C is consumption expenditures, 
Y is disposable income, and L is liquid assets at the end of the previous quarter. 
All figures are in billions of 1954 dollars. 9 Using the 38 observations we get the 
following regression equations: 

C = -7.160 + 0.95213 Y r 2 = 0.9933 (7.8) 

(-1.93) (73.25) 

C = - 10.627 + 0.68166F + 0.37252 L R 2 = 0.9953 (7.9) 

(-3.25) ■ (9.60) (3.%) 


9 The data are from Z. Griliches et al., “Notes on Estimated Aggregate Quarterly Consumption 
Functions,” Econometrica, July 1962. 
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Table 7.1 Data on Consumption, Income, and Liquid Assets 


Year, 

Quarter 

C 

Y 

L 

Year, 

Quarter 

C 

Y 

L 

1952 

I 

220.0 

238.1 

182.7 

1957 

I 

268.9 

291.1 

218.2 


II 

222.7 

240.9 

183.0 


II 

270.4 

294.6 

218.5 


III 

223.8 

245.8 

184.4 


III 

273.4 

296.1 

219.8 


IV 

230.2 

248.8 

187.0 


IV 

272.1 

29jL3 

219.5 

1953 

I 

234.0 

253.3 

189.4 

1958 

I 

268.9 

291.3 

220.5 


II 

236.2 

256.1 

192.2 


II 

270.9 

292.6 

222.7 


III 

236.0 

255.9 

193.8 


III 

274.4 

299.9 

225.0 


IV 

234.1 

255.9 

194.8 


IV 

278.7 

302.1 

229.4 

1954 

I 

233.4 

254.4 

197.3 

1959 

I 

283.8 

305.9 

232.2 


II 

236.4 

254.8 

197.0 


II 

289.7 

312.5 

235.2 


III 

239.0 

257.0 

200.3 


III 

290.8 

311.3 

237.2 


IV 

243.2 

260.9 

204.2 


IV 

292.8 

313.2 

237.7 

1955 

I 

248.7 

263.0 

207.6 

1960 

I 

295.4 

315.4 

238.0 


II 

253.7 

271.5 

209.4 


II 

299.5 

320.3 

238.4 


III 

259.9 

276.5 

211.1 


III 

298.6 

321.0 

240.1 


IV 

261.8 

281.4 

213.2 


IV 

299.6 

320.1 

243.3 

1956 

I 

263.2 

282.0 

214.1 

1961 

I 

297.0 

318.4 

246.1 


II 

263.7 

286.2 

216.5 


II 

301.6 

324.8 

250.0 


III 

263.4 

287.7 

217.3 







IV 

266.9 

291.0 

217.3 







L = 9.307 + 0.76207 L r\ Y = 0.9758 (7.10) 

(1.80) (37.20) 

(Figures in parentheses are 7-ratios, not standard errors.) 

Equation (7.10) shows that L and Y are very highly correlated. In fact, sub¬ 
stituting the value of L in terms of Y from (7.10) into equation (7.9) and sim¬ 
plifying, we get equation (7.8) correct to four decimal places! However, looking 
at the r-ratios in equation (7.9) we might conclude that multicollinearity is not 
a problem. 

Are we justified in this conclusion? Let us consider the stability of the coef¬ 
ficients with deletion of some observations. Using only the first 36 observations 
we get the following results: 


C = -6.980 + 0.95145F 

(- 1 , 74 ) (67 04 ) 


r 2 = 0.9925 

(7.11) 

c = -13.391 + 0.63258 F + 0.45065L 

(-3 71 ) ( 8 . 32 ) (4 24 ) 

R 2 = 0.9951 

(7.12) 

L = 14.255 + 0.70758 y 


r 2 LY = 0.9768 

(7-13) 


(2 69) (37 80) 


Comparing (7.11) with (7.8) and (7.12) with (7.9) we see that the coefficients 
in the latter equation show far greater changes than in the former equation. Of 
course, if one applies the tests for stability discussed in Section 4.11, one might 
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conclude that the results are not statistically significant at the 5% level. Note 
that the test for stability that we use is the “predictive” test for stability. 

Finally, we might consider predicting C for the first two quarters of 1961 
using equations (7.11) and (7.12). The predictions are: 




Equation (7.11) 

Equation (7.12) 

1961 

I 

295.96 

298.93 


II 

302.05 

304.73 


Thus the prediction from the equation including L are further off from the 
true values than the predictions from the equations excluding L. Thus if pre¬ 
diction was the sole criterion, one might as well drop the variable L. 

The example above illustrates four different ways of looking at the multicol- 
linearity problem: 

1. Correlation between the explanatory variables L and Y, which is high. 
This suggests that the multicollinearity may be serious. However, we ex¬ 
plained earlier the fallacy in looking at just the correlation coefficients 
between the explanatory variables. 

2. Standard errors or /-ratios for the estimated coefficients: In this example 
the /-ratios are significant, suggesting that multicollinearity might not be 
serious. 

3. Stability of the estimated coefficients when some observations are de¬ 
leted. Again one might conclude that multicollinearity is not serious, if 
one uses a 5% level of significance for this test. 

4. Examining the predictions from the model: If multicollinearity is a seri¬ 
ous problem, the predictions from the model would be worse than those 
from a model that includes only a subset of the set of explanatory vari¬ 
ables. 

The last criterion should be applied if prediction is the object of the analysis. 
Otherwise, it would be advisable to consider the second and third criteria. The 
first criterion is not useful, as we have so frequently emphasized. 

7.5 Solutions to the Multicollinearity 
Problem: Ridge Regression 

One of the solutions often suggested for the multicollinearity problem is to use 
what is known as ridge regression first introduced by Hoerl and Kennard. 10 
Simply stated, the idea is to add a constant \ to the variances of the explana¬ 
tory variables before solving the normal equations. For instance, in the exam- 


l0 A. E. Hoerl and R. W. Kennard, “Ridge Regression: Biased Estimation for Non-orthogonal 
Problems,” and “Ridge Regression: Applications to Non-orthogonal Problems,” Tethnome- 
trics, Vol. 12, 1977, pp. 55-67 and 69-82. 
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pie in Section 7.2, we add 5 to 5,, and S 2 2- It is easy to see that the squared 
correlation now drops to 


r 


2 

12 


(150) 

205(118) 


0.930 


Thus the intercorrelations are decreased. One can easily see what a mechanical 
solution this is. However, there is an enormous literature on ridge regression. 

The addition of X. to the variances produces biased estimators but the argu¬ 
ment is that if the variance can be decreased, the mean-squared error will de¬ 
cline. Hoerl and Kennard show that there always exists a constant X > 0 such 
that 


k k 

2 MSE(p,) < 2 MSE(P,) 

1=1 1=1 

where (3, are the estimators of (3, from the ridge regression and (3, are the least 
squares estimators and k is the number of regressors. Unfortunately, X is a 
function of the regression parameters (3, and error variance cr 2 , which are un¬ 
known. Hoerl and Kennard suggest trying different values of X and picking the 
value of X so that “the system will stabilize” or the “coefficients do not have 
unreasonable values.” Thus subjective arguments are used. Some others have 
suggested obtaining initial estimates of (3, and cr 2 and then using the estimated 
X. This procedure can be iterated and we get the iterated ridge estimator. The 
usefulness of these procedures has also been questioned." 

One other problem about ridge regression is the fact that it is not invariant 
to units of measurement of the explanatory variables and to linear transfor¬ 
mations of variables. If we have two explanatory variables x, and x 2 and we 
measure x, in tens and x 2 in thousands, it does not make sense to add the same 
value of X to the variances of both. This problem can be avoided by normalizing 
each variable by dividing it by its standard deviation. Even if x, and x 2 are 
measured in the same units, in some cases there are different linear transfor¬ 
mations of x, and x 2 that are equally sensible. For instance, as discussed in 
Section 7.4, equations (7.5), (7.6), (7.7) are all equivalent and they are all sen¬ 
sible. The ridge estimators, however, will differ depending on which of these 
forms is used. 

There are different situations under which the ridge regression arises natu¬ 
rally. These will throw light on the matter of the circumstsances under which 
the method will be useful. We mention three of them. 

1. Constrained least squares. Suppose that we estimated the regression 
coefficients subject to the condition that 

Ep! = c (7.14) 

/=! 


"These methods are all reviewed in N. R. Draper and R. Craig Van Nostrand, “Ridge Regres¬ 
sion and James-Stein Estimation. Review and Comments,” Technometrics, Vol. 21, No. 4, 
November 1979, pp. 451-466. The authors do not approve of these methods and discuss the 
shortcomings of each. 
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Then we would get something like the ridge regression. The X thai v we use is 
the Lagrangian multiplier in the minimization. To see this, supposethat we 
have two explanatory variables. 

We get the constrained least squares estimator by minimizing 

X O' - 01*1 - 02 * 2 ) 2 + M 01 + 02 - c) 

where X is the Lagrangian multiplier. Differentiating this expression with re¬ 
spect to p, and p 2 and equating the derivatives to zero, we get the normal 
equations 

2 X O' ~ 0i*i _ 02*2>(-*i) + 2XP, = 0 

2 X (y - 01*1 - 02 * 2 )(-* 2 ) + 2 xp 2 = 0 

These equations can be written as 

(5,1 + X)Pl 4- 5, 2 P 2 = S iy 
5,201 + (5 22 + X)p 2 = 5 2v 

where 5 U = X *i> 5, 2 = X * 1 * 2 . and so on. Thus we get the ridge regression 
and X is the Lagrangian multiplier. The value of X is decided by the criterion 
0i + 0 2 = c. In this case there is a clear-cut procedure for choosing X. 

It is rarely the case that we would have prior knowledge about the p, that is 
in the form X 0? = c - But some other less concrete information can also be 
used to choose the value of X in ridge regression. Brown and Beattie’s 12 ridge 
regression on production function data used their prior knowledge on the re¬ 
lationship between the signs of the p,’s. 

2. Bayesian interpretation. We have not discussed the Bayesian approach to 
statistics in this book. However, roughly speaking, what the approach does is 
to combine systematically some prior information on the regression parameters 
with sample information. Under this approach we get the ridge regression es¬ 
timates of the p,’s if we assume that the prior information is of the form that 
P, ~ IN(0, Op). In this case the ridge constant X is equal to o 2 /op. Again, ct 2 is 
not known but has to be estimated. However, in almost all economic problems, 
this sort of prior information (that the means of the p,’s are zero) is very unrea¬ 
sonable. This suggests that the simple ridge estimator does not make sense in 
econometrics (with the Bayesian interpretation). Of course, the assumption 
that p, has mean zero can be relaxed. But then we will get more complicated 
estimators (generalized ridge estimators). 

3. Measurement error interpretation. Consider the two-variable model we 
discussed under constrained least squares. Suppose that we add random errors 
with zero mean and variance X to both x, and x 2 . Since these errors are random, 
the covariance between x { and x 2 will be unaffected. The variances of x, and x 2 


l! W. G. Brown and B. R. Beattie, “Improving Estimates of Economic Parameters by the Use 
of Ridge Regression with Production Function Applications,” American Journal of Agricultural 
Economics, Vol. 57, 1975, pp. 21-32. 
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will both increase by X. Thus we get the ridge regression estimator. This inter¬ 
pretation makes the ridge estimator somewhat suspicious. Smith and 
Campbell 13 say that a one-liner summary of this is: “Use less precise data to 
get more precise estimates.” 

These are situations in which the ridge regression can be easily justified. In 
almost all other cases, there is subjective judgment involved. This subjective 
judgment is sometimes equated to “vague prior information.” The Bayesian 
methods allow a systematic analysis of the data with “vague prior information” 
but a discussion of these methods is beyond the scope of this book. 

Because of the deficiencies of ridge regression discussed above, the method 
is not recommended as a general solution to the multicollinearity problem. Par¬ 
ticularly the simplest form of the method (where a constant X is added to each 
variance) is not very useful. Nevertheless, for the sake of curiosity, we will 
present some results on the method. For the consumption function data in Ta¬ 
ble 7.1 we estimated the regression equation 

C, = Poy, + Pltt-l + 3^,-2 + • • • + Pgtt-S + U, 

Needless to say, the y,' s are highly intercorrelated. The results are presented in 
Table 7.2. Note that as X increases, there is a smoothing of the coefficients and 
the estimate of p„ declines. The OLS coefficients, of course, are very erratic. 
But the estimates of 3„ (portion of current income going into current consump¬ 
tion) are implausibly low with the ridge regression method. The sudden pickup 
of coefficients after the fifth quarter is also something very implausible. Maybe 
we can just estimate the effects only up to four lags. The OLS estimates are 
erratic even with four lags. The computation of the ridge regression estimates 
with four lags is left as an exercise. 


Table 7.2 Ridge Estimates for Consumption Function Data 


Lag 



Value of X 



0.0 

0.0002 

0.0006 

0.0010 

0.0014 

0.0020 

0 

0.70974 

0.42246 

0.29302 

0.24038 

0.21096 

0.18489 

1 

0.20808 

0.28187 

0.22554 

0.19578 

0.17773 

0.16096 

2 

0.27463 

0.15615 

0.14612 

0.13865 

0.13324 

0.12764 

3 

-0.48068 

-0.06079 

0.03052 

0.05761 

0.07060 

0.08088 

4 

0.25129 

-0.00301 

0.02429 

0.04473 

0.05736 

0.06902 

5 

-0.23845 

-0.06461 

-0.00562 

0.02304 

0.04010 

0.05578 

6 

0.12432 

0.01705 

0.03600 

0.05116 

0.06135 

0.07138 

7 

-0.11278 

0.06733 

0.07964 

0.08491 

0.08862 

0.09254 

8 

0.19838 

0.12632 

0.11941 

0.11563 

0.11367 

0.11220 

Sum 

0.93453 

0.94277 

0.94892 

0.95189 

0.95363 

0.95529 


l3 Gary Smith and Frank Campbell, “A Critique of Some Ridge Regression Methods” (with 
discussion), Journal of the American Statistical Association, Vol. 75, March 1980, pp. 74-103. 
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7.6 Principal Component Regression 


Another solution that is often suggested for the multicollinearity problem is the 
principal component regression, which is as follows. Suppose that we have k 
explanatory variables. Then we can consider linear functions of these vari¬ 
ables: 

Z\ = OiX, + a 2 x 2 + * • • + apc k 

z 2 = b\*\ + b 2 x 2 + • • • + b k x k etc. 

Suppose we choose the a’ s so that the variance of z, is maximized subject to 
the condition that 

a] + a 2 + • • • + a\ = 1 

This is called the normalization condition. (It is required or else the variance 
of z t can be increased indefinitely.) z, is then said to be the first principal com¬ 
ponent. It is the linear function of the x’s that has the highest variance (subject 
to the normalization rule). 

The detailed derivation of the principle components is given in the appendix. 
We will discuss the main features and uses of the method which are easy to 
understand without the use of matrix algebra. Further, for using the method 
there are computer programs available that give the principal components (z’s) 
given any set of variables a,, x 2 . x k . 

The process of maximizing the variance of the linear function z subject to 
the condition that the sum of squares of the coefficients of the x’s is equal to 1, 
produces k solutions. Corresponding to these we construct k linear functions 
Zi, z 2 , , z. k . These are called the principal components of the x’s. They can 

be ordered so that 


, var(z,) > var(z 2 ) > • • • > var(z*) 

Z\, the one with the highest variance is called the first principal component, z 2 
with the next highest variance is called the second principal component, and 
so on. These principal components have the following properties: 

1. var(z|) + var(z 2 ) + • • • + var(z t ) = var(x,) + var(x 2 ) + • • • + var(xj. 

2. Unlike the x’s, which are correlated, the z’s are orthogonal or uncorre¬ 
lated. Thus there is zero multicollinearity among the z’s. 

Sometimes it is suggested that instead of regressing y on x„ x 2 , . . . , x k , we 
should regress y on z t , z 2 , ■ ■ ■ , z k . But this is not a solution to the multicolli¬ 
nearity problem. If we regress y on the z’s and then substitute the values of z’s 
in terms of x’s, we finally get the same answers as before. This is similar to the 
example we considered in Section 7.4. The fact that the z’s are uncorrelated 
does not mean that we will get better estimates of the coefficients in the original 
regression equation. So there is a point in using the principal components only 
if we regress y on a subset of the z’s. But there are some problems with this 
procedure as well. They are: 
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1. The first principal component z„ although it has the highest variance, 
need not be the one that is most highly correlated with y. In fact, there is 
no necessary relationship between the order of the principal components 
and the degree of correlation with the dependent variable y. 

2. One can think of choosing only those principal components that have 
high correlation with y and discard the rest, but the same sort of proce¬ 
dure can be used with the original set of variables x u x 2 , . . . , x k by first 
choosing the variable with the highest correlation with y, then the one 
with the highest partial correlation, and so on. This is what “stepwise 
regression programs” do. 

3. The linear combinations z’s often do not have economic meaning. For 
example, what does 2(income) + 3(price) mean? This is one of the most 
important drawbacks of the method. 

4. Changing the units of measurement of the x's will change the principal 
components. This problem can be avoided if all variables are standard¬ 
ized to have unit variance. 

However, there are some uses for the principal component method in ex¬ 
ploratory stages of the investigation. For instance, suppose that there are many 
interest rates in the model (since all are measured in the same units, there is no 
problem of choice of units of measurement). If the principal component anal¬ 
ysis shows that two principal components account for 99% of the variation in 
the interest rates and if by looking at the coefficients, we can identify them as 
short-term component and long-term component, we can argue that there are 
only two “latent” variables that account for all variations in the interest rates. 
Thus the principal component method will give us some guidance as to the 
question: “How many independent sources of variation are there?” In addi¬ 
tion, if we can give an economic interpretation to the principal components, 
this is useful. 

We illustrate the method with reference to a data set from Malinvaud. 14 We 
have chosen this data set because it has been used by Chatterjee and Price 11 to 
illustrate the principal component method. We will also be using this same data 
set in Chapter 11 to illustrate the errors in variables methods. 

The data are presented in Table 7.3. First let us estimate an import demand 
function. The regression of y on x 2 , x 2 gives the following results. 


Variable 

Coefficient 

SE 

t 

x x 

0.032 

0.187 

.17 

*2 

0.414 

0.322 

1.29 

*3 

0.243 

0.285 

0.85 

Constant 

-19.73 

4.125 

-4.78 

n = 

18 R 2 = 0.973 

F 314 = 168.4 


I4 E. Malinvaud, Statistical Methods of Econometrics, 2nd ed. (Amsterdam: North-Holland, 
1970). 

I5 S. Chatterjee and B. Price, Regression Analysis by Example (New York: Wiley, 1977). 
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Table 7.3 Imports, Production, Stock Formation, and Consumption in France 
(Millions of New Francs at 1959 Prices) 


Years 

Imports, 

y 

Gross 

Domestic 

Production, 

X l 

Stock 

Formation, 

x 2 

Consumption, 

x 3 

1949 

15.9 

149.3 

4.2 

108.1 

1950 

16.4 

161.2 

4.1 

114.8 

1951 

19.0 

171.5 

3.1 

123.2 

1952 

19.1 

175.5 

3.1 

126.9 

1953 

18.8 

180.8 

1.1 

132.1 

1954 

20.4 

190.7 

2.2 

137.7 

1955 

22.7 

202.1 

2.1 

146.0 

1956 

26.5 

212.4 

5.6 

154.1 

1957 

28.1 

226.1 

5.0 

162.3 

1958 

27.6 

231.9 

5.1 

164.3 

1959 

26.3 

239.0 

0.7 

167.6 

1960 

31.1 

258.0 

5.6 

176.8 

1961 

33.3 

269.8 

3.9 

186.6 

1962 

37.0 

288.4 

3.1 

199.7 

1963 

43.3 

304.5 

4.6 

213.9 

1964 

49.0 

323.4 

7.0 

223.8 

1965 

50.3 

336.8 

1.2 

232.0 

1966 

56.6 

353.9 

4.5 

242.9 


Source. E. Mallinvaud, Statistical Methods of Econometrics, 2nd ed. (Amsterdam: North- 
Holland, 1970), p. 19. 


The R 2 is very high and the F-ratio is highly significant but the individual t- 
ratios are all insignificant. This is evidence of the multicollinearity problem. 
Chatterjee and Price argue that before any further analysis is made, we should 
look at the residuals from this equation. They find (we are omitting the residual 
plot here) a distinctive pattern—the residuals declining until 1960 and then ris¬ 
ing. Chatterjee and Price argue that the difficulty with the model is that the 
European Common Market began operations in 1960, causing changes in im¬ 
port-export relationships. Hence they drop the years after 1959 and consider 
only the 11 years 1949-59. The regression results now are as follows: 


Variable 

Coefficient 

SE 

t 

X, 

-0.051 

0.070 

-0.731 

X 2 

0.587 

0.905 

6.203 

*3 

0.287 

0.102 

2.807 

Constant 

-10.13 

1.212 

-8.355 


n = 11 R 2 

= 0.992 
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The residual plot (not shown here) is now satisfactory (there are no system¬ 
atic patterns), so we can proceed. Even though the R 2 is very high, the coeffi¬ 
cient of a-, is not significant. There is thus a multicollinearity problem. 

To see what should be done about it, we first look at the simple correla¬ 
tions among the explanatory variables. These are r} 2 = 0.026, r} 3 = 
0.99, and r\ 3 = 0.036. We suspect that the high correlation between x, and a 3 
could be the source of the trouble. 

Does principal component analysis help us? First, the principal components 
(obtained from a principal components program) 16 are: 

z, = 0.7063*, + 0.0435* 2 + 0.7065*, 

z.2 = -0.0357*, + 0.9990X, - 0.0258*, 

z 3 = -0.7070*, - 0.0070* 2 + 0.7072*, 

*!, * 2 , *3 are the normali z ed values of a,, x 2 , a,. That is, *, = (a, - ra,) At,, * 2 
= (a 2 — m 2 )/cr 2 , and *, — (a 3 — ra 3 )/tr 3 , where m,, m 2 , m 3 are the means and 
it,, cr 2 , ct, are the standard deviations of a„ a 2 , a 3 , respectively. Hence 

var(*,) = var(* 2 ) = var(* 3 ) = 1 

The variances of the principal components are 

var(z,) = 1.999 var(z 2 ) = 0.998 var(z 3 ) = 0.003 

Note that ^ var(z,) = 2 var(*,) = 3. The fact that var(z 3 ) = 0 identifies that 
linear function as the source of multicollinearity. In this example there is only 
one such linear function. In some examples there could be more. Since £(*,) 
= E(X 2 ) = E(X 3 ) = 0 because of normalization, the z' s have mean zero. Thus 
z 3 has mean zero and its variance is also close to zero. Thus we can say that z, 
— 0. Looking at the coefficients of the *’s, we can say that (ignoring the coef¬ 
ficients that are very small) 

z, - 0.706(*, + *j) 

Zl — *2 

z 3 - 0.707(* 3 - *,) 
z 3 == 0 gives us *, — * 3 

Actually, we would have gotten the same result from a regression of * 3 on *,. 
The regressison coefficient is r l3 = 0.9984. (Note that *, and * 3 are in stan¬ 
dardized form. Hence the regression coefficient is r I3 .) 

In terms of the original (nonnormalized) variables the regression of x 3 on a, 
is 

a 3 = 6.258 + 0.686a, r 2 = 0.998 

(0.0077) 

l6 These are from Chatterjee and Price, Regression Analysis, p. 161. The details of how the 
principal components are computed need not concern us here. 
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(The figure in parentheses is the SE.) 

In a way we have got no more information from the principal component 
analysis than from a study of the simple correlations in this example. Anyway, 
what is the solution now? Given that there is an almost exact relationship be¬ 
tween x, and x„ we cannot hope to estimate the coefficients of X, and x 3 sepa¬ 
rately. If the original equation is 


y = 00 + 01*, + 02*2 + 01*3 + « 


then substituting for x 3 in terms of x, we get 

y = (0o + 6.258p 3 ) + (0, + 0.686p 3 )x, + p 2 * 2 + « 

This gives the linear functions of the P’s that are estimable. They are p 0 + 
6.258p 3 , p, 4- 0.686p 3 , and p 2 . The regression of y on x, and x 2 gave the follow¬ 
ing results. 


Variable 

Coefficient 

SE 

t 

*i 

0.145 

0.007 

20.67 

* 2 

0.622 

0.128 

4.87 

Constant 

-8.440 

1.435 

-5.88 


R 2 = 0.983 



Of course, we can estimate a regression of x, and x 3 . The regression coeffi¬ 
cient is 1.451. We now substitute for x, and estimate a regressison of y on x, 
and x 3 . The results we get are slightly better (we get a higher R 2 ). The results 
are: 


Variable 

Coefficient 

SE 

t 

*2 

0.596 

0.091 

6.55 

*3 

0.212 

0.007 

29.18 

Constant 

-9.743 

1.059 

-9.20 


R 2 = 0.991 



The coefficient of x 3 now is (p 3 + 1.451 Pi). 

We can get separate estimates of p, and p 3 only if we have some prior infor¬ 
mation. As this example as well as the example in Section 7.3 indicate, what 
multicollinearity implies is that we cannot estimate individual coefficients with 
good precision but can estimate some linear functions of the parameters with 
good precision. If we want to estimate the individual parameters, we would 
need some prior information. We will show that the use of principal compo¬ 
nents implies the use of some prior information about the restrictions on the 
parameters. 

Suppose that we consider regressing y on the principal components z, and z 2 
(z 3 is omitted because it is almost zero). We saw that z, = 0.7(X, + X 3 ) and 
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z 2 = X 2 - We have to transform these to the original variables. We get 


Zi = 




0/7 

O’! 




+ a constant 


z 2 = — (x 2 - m 2 ) 

ct 2 


Thus, using z 2 as a regressor is equivalent to using x 2 , and using z, is equivalent 
to using x, + (cti/ct 3 )jc 3 . Thus the principal component regression amounts to 
regressing y on (a, + (ct,/ct 3 (a 3 ) and x 2 . In our example ct,/ct 3 = 1.4536. The 
results are: 


Variable 

Coefficient 

SE 

t 

x, + 1.4536 x 3 

0.073 

0.003 

25.12 

*2 

0.609 

0.106 

5.77 

Constant 

-9.129 

1.207 

-7.56 


R 2 = 0.988 




This is the regression equation we would have estimated if we assumed that 
0 3 = (ct,/ct 3 )0, = 1.45360,. Thus the principal component regression amounts, 
in this example, to the use of the prior information 0 3 = 1.45360,. 

If all principal components are used, it is exactly equivalent to using all the 
original set of explanatory variables. If some principal components are omit¬ 
ted, this amounts to using some prior information on the 0’s. In our example 
the question is whether the assumption 0 3 = 1.450, makes economic sense. 
Without having more disaggregated data that break down imports into con¬ 
sumption and production goods we cannot say anything. Anyway, with 11 ob¬ 
servations we cannot hope to answer many questions. The purpose of our anal¬ 
ysis has been merely to show what principal component regression is and to 
show that it implies some prior information. 


7.7 Dropping Variables 


The problem with multicollinearity is essentially lack of sufficient information 
in the sample to permit accurate estimation of the individual parameters. In 
some cases it may be the case that we are not interested in all the parameters. 
In such cases we can get estimators for the parameters we are interested in that 
have smaller mean square errors than the OLS estimators, by dropping some 
variables. 

Consider the model 


y = 01*1 + 02*2 + H 


(7.15) 
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and the problem is that jc, and x 2 are very highly correlated. Suppose that our 
main interest is in p,. Then we drop x 2 and estimate the equation 


y = Pi^i + v 


(7.16) 


Let the estimator of from the complete model (7.15) be denoted by 0, and 
the estimator of 0, from the omitted variable model be denoted by p[. 0, is the 
OLS estimator and 0j is the OV (omitted variable) estimator. For the OLS 
estimator we know that 


F(Pi) = Pj and var(0,) 


*S]i(l f] 2 ) 


[see formula (7.1)]. For the OV estimator we have to compute 
£(p[) and var(pj). Now 


Pi = 




x] 


p; 


substituting for y from (7.15) we get 

2 *l(Pl*l + 02*2 + «) 

2 x] 

5 12 2 X,B 

— Pi + 7T 02 H -^ 

*^11 *^11 

(Note that we used S n = ^ *1 and S l2 — ^ x \ x i ) Hence 

5| 2 


F(PD = p, + p 2 


Sn 


and 


var(p*) = var 


2 x t u \ ct 2 5„ 


cr 


Sn ) 5?, 

Thus pj is biased but has a smaller variance than 0,. We have 


var(pj) 

var(0,) 


r \2 


and if r l2 is very high, then var (pj) will be considerably less than var(0,). 
Now 


(Bias in p[) 2 = / 0 2 5 l2 \ 2 S„(l - r| 2 ) 
var(p,) V Su / ct 2 


*^12 g 2 ^ 2 2 (1 r h ) 

s„s 22 P2 
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The first term in this expression is rf 2 . The last term is the reciprocal of 
var(p 2 ) [see formula (7.2)]. Thus the whole expression can be written as r\ 2 t\, 
where 


2 var(P 2 ) 

t 2 is the “true” 7-ratio for x 2 in equation (7.15), not the “estimated” /-ratio. 

Noting that mean-square error MSE = (bias) 2 + variance, and that for 0,, 
MSE = variance, we get 

MSE(pi) _ (bias in p]) 2 var(ft]) 

MSE(Pj) var((3,) + var((3,) 


12'2 


+ (1 - r} 2 ) 


= 1 + r\ 2 {t\ - 1) 


(7.17) 


Thus if |/ 2 | < 1 , then MSE(pi) < MSE((3,). Since t 2 is not known, what is usu¬ 
ally done is to use the estimated /-value t 2 from equation (7.15). As an estimator 
of (3,, we use the conditional-omitted-variable (COV) estimator, defined as 


Pi = 



the OLS estimator if |/ 2 | 1 

the OV estimator if |/ 2 | < 1 


Also, instead of using 0, or p], depending on 7 2 we can consider a linear com¬ 
bination of both, namely 


xp, + (i - x)p; 


This is called the weighted (WTD) estimator and it has minimum mean square 
error if X = 75/(1 + 7 2 ). Again t 2 is not known and we have to use its estimated 
value / 2 . This weighted estimator was first suggested by Huntsberger. The COV 
estimator was first suggested by Bancroft. Feldstein 17 studied the mean-squared 
error of these two estimators for different values of t 2 and / 2 . He argues that: 


1. Omitting a collinear nuisance variable on the basis of its sample /-statistic 
t 2 is generally not advisable. OLS is preferable to any COV estimator 
unless one has a strong prior notion that |/ 2 | is < 1. 

2. The WTD estimator is generally better than the COV estimator. 

3. The WTD estimator is superior to OLS for |/ 2 | < 1.25 and only slightly 

inferior for 1.5 < |z 2 | 3.0. 

4 . The inadequacy of collinear data should not be disguised by reporting 
results from the omitted variable regressions. Even if a WTD estimator 
is used, one should report the OLS estimates and their standard errors to 
let readers judge the extent of multicollinearity. 


I7 M. S. Feldstein, “Multicollinearity and the Mean Square Error of Alternative Estimators,” 
Econometrica, Vol. 41, March 1973, pp. 337-345. References to the papers by Bancroft, Hunts¬ 
berger, and others can be found in that paper. 
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What all this discussion shows is that even in the use of the COV or WTD 
estimators prior information on t 2 is very important. This brings us back to the 
same story as our discussion of the ridge regression and principal components 
regression, namely, the importance of prior information. The prior information 
regarding the omission of nuisance variables pertains to the true /-values for 
the coefficients of these variables. 

Learner 18 suggests studying the sensitivity of estimates of the coefficients to 
different specifications about prior information on the coefficients. Although 
his approach is Bayesian and is beyond the scope of this book, one can do a 
simple sensitivity analysis in each problem to assess the impact on the esti¬ 
mates of the coefficients of interest of changes in the assumptions about the 
coefficients of the nuisance parameters. Such sensitivity analysis would be 
more useful than using one solution like ridge regression, principal component 
regression, omitting variables, and so on, each of which implies some particular 
prior information in a concealed way. Very often, this may not be the prior 
information you would want to consider. 


7.8 Miscellaneous Other Solutions 


There have been several other solutions to the multicollinearity problem that 
one finds in the literature. All these, however, should be used only if there are 
other reasons to use them—not for solving the collinearity problem as such. 
We will discuss them briefly. 


Using Ratios or First Differences 

We have discussed the method of using ratios in our discussion of heteroske- 
dasticity (Chapter 5) and first differences in our discussion of autocorrelation 
(Chapter 6). Although these procedures might reduce the intercorrelations 
among the explanatory variables, they should be used on the basis of the con¬ 
siderations discussed in those chapters, not as a solution to the collinearity 
problem. 


Using Extraneous Estimates 

This method was followed in early demand studies. It was found that in time- 
series data income and price were both highly correlated. Hence neither the 
price nor income elasticity could be estimated with precision. What was done 
was to get an estimate of the income elasticity from budget studies (where 


IR E. E. Learner, “Multicollinearity: A Bayesian Interpretation,” Review of Economics and Sta¬ 
tistics, Vol. 55, 1973, pp. 371-380; “Regression Selection Strategies and Revealed Priors,” 
Journal of the American Statistical Association, Vol. 73, 1978, 580-587. 
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prices do not vary much) use this estimate to “correct” the quantity series for 
income variation and then estimate the price elasticity. 19 For example, if the 
equation to be estimated is 

log Q = a + log p + 0 2 log y + u 

we first get 0 2 from budget studies and then regress (log Q - 0 2 1°§ >’) on log 
p to get estimates of a and 3,. Here 0 2 is known as the “extraneous estimate.” 
There are two main problems with this procedure. First, the fact that 0 2 is 
estimated should be taken into account in computing the variances of 
a and (3,. This is not usually done, but it can be. Second, and this is the more 
important problem, the cross-section estimate of 0 2 may be measuring some¬ 
thing entirely different from what the time-series estimate is supposed to mea¬ 
sure. As Meyer and Kuh 20 argue, the “extraneous” estimate can be really ex¬ 
traneous. 

Suppose that we want to use an estimate for a parameter from another data 
set. What is the best procedure for doing this? Consider the equation 

y, = 0!*, + 0 2 x 2 + u (7.18) 

Suppose that because of the high correlation between x t and x 2 , we cannot get 
good estimates of 0, and 0 2 . We try to get an estimate of 0, from another data 
set and another equation 

y 2 = 01*1 + yz + V (7.19) 

In this equation x , and z are not highly correlated and we get a good estimate 
of 0,, say 0,. Now we substitute this in (7.18) and regress (v, - 0,.*,) on x 2 to 
get an estimate 0 2 of 0 2 . This is the procedure we mentioned earlier. The esti¬ 
mate of 02 is a conditional estimate, conditional on 0, = 0,. Also, we have to 
make corrections for the estimated variance of 0 2 because the error in the equa¬ 
tion now is 

(y - 0 i*i) = 02*2 + W 

where W = u + (0, - 0,)x, is not the same as u. This procedure is advisable 
only when the data behind the estimation of (7.19) are not available to us (the 
study is done by somebody else). 

On the other hand, if the two sets of data are available to us, there is no 
reason to use this conditional estimation procedure. A better procedure would 
be to estimate equations (7.18) and (7.19) jointly. This is what was done by 
Maddala 21 for the data used by Tobin in his study on demand for food. It is also 
possible to test, by using the joint estimation of equations (7.18) and (7.19) and 


,9 An example of this is J. Tobin, “A Statistical Demand Function for Food in the U.S.A.,” 
Journal of the Royal Statistical Society, Series A, 1950, pp. 113-141. 

“John Meyer and Edwin Kuh, “How Extraneous Are Extraneous Estimates?" Review of Eco¬ 
nomics and Statistics, November 1957. 

:l G. S. Maddala, “The Likelihood Approach to Pooling Cross-Section and Time-Series Data,” 
Econometrica, Vol. 39, November 1971, pp. 939-953. 
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separate estimation of the equations, whether the coefficient of x , is the same 
in the two equations. 22 

In summary, as a solution to the multicollinearity problem, it is not advisable 
to substitute extraneous parameter estimates in the equation. One can, of 
course, pool the different data sets to get more efficient estimates of the param¬ 
eters, but one should also perform some tests to see whether the parameters 
in the different equations are indeed the same. 

Getting More Data 

One solution to the multicollinearity problem that is often suggested is to “go 
and get more data.” Actually, the extraneous estimators case we have dis¬ 
cussed also falls in this category (we look for another model with common 
parameters and the associated data set). Sometimes using quarterly or monthly 
data instead of annual data helps us in getting better estimates. However, we 
might be adding more sources of variation like seasonality. In any case, since 
weak data and inadequate information are the sources of our problem, getting 
more data will help matters. 


Summary 


1. In multiple regression analysis it is usually difficult to interpret the esti¬ 
mates of the individual coefficients if the variables are highly intercorrelated. 
This problem is often referred to as the multicollinearity problem. 

2. However, high intercorrelations among the explanatory variables by 
themselves need not necessarily cause any problems in inference. Whether or 
not this is a problem will depend on the magnitude of the error variance and 
the variances of the explanatory variables. If there is enough variation in the 
explanatory variables and the variance of the error term is sufficiently small, 
high intercorrelations among the explanatory variables need not cause a prob¬ 
lem. This is illustrated by using formulas (7.1)-(7.4) in Section 7.2. 

3. Measures of multicollinearity based solely on high intercorrelations 
among the explanatory variables are useless. These are discussed in Section 
7.3. Also, as shown in Section 7.4, these correlations can change with simple 
transformations of the explanatory variables. This does not mean that the prob¬ 
lem has been solved. 

4. There have been several solutions to the multicollinearity problem. These 
are: 

(a) Ridge regression, on which an enormous amount of literature exists. 

(b) Principal component regression—this amounts to transforming the explan¬ 
atory variables to an uncorrelated set, but mere transformation does not 
solve the problem. (It appears through the back door.) 


22 In the case of Tobin’s demand for food example, this test done by Maddala showed that there 
were significant differences between the two parameters. 
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(c) Dropping variables. 

All of these so-called solutions are really ad hoc procedures. Each implies the 
use of some prior information and it is better to examine this before undertak¬ 
ing a mechanical solution that has been suggested by others. 

5. The basic problem is lack of enough information to answer the questions 
posed. The only solutions are: 

(a) To get more data. 

(b) To ask what questions are answerable with the data at hand. 

(c) To examine what prior information will be most helpful [in fact, this should 
precede solution (a)]. 


Exercises 


1. Define the term “multicollinearity.” Explain how you would detect its pres¬ 
ence in a multiple regression equation you have estimated. What are the 
consequences of multicollinearity, and what are the solutions? 

2. Explain the following methods. 

(a) Ridge regression. 

(b) Omitted-variable regression. 

(c) Principle component regression. 

What are the problems these methods are supposed to solve? 

3. Examine whether the following statements are true or false. Give an expla¬ 
nation. 

(a) In multiple regression, a high correlation in the sample among the re¬ 
gressors (multicollinearity) implies that the least squares estimators of 
the coefficients are biased. 

(b) Whether or not multicollinearity is a problem cannot be decided by 
just looking at the intercorrelations between the explanatory vari¬ 
ables. 

(c) If the coefficient estimates in an equation have high standard errors, 
this is evidence of high multicollinearity. 

(d) The relevant question to ask if there is high multicollinearity is not 
what variables to drop but what other information will help. 

4. In a study analyzing the determinants of faculty salaries, the results shown 
on p. 296 were obtained. 23 The dependent variable is 1969-1970 academic 
year salary. We have omitted eight more other explanatory variables. 

(a) Do any of the coefficients have unexpected signs? 

(b) Is there a multicollinearity problem? What variables do you expect to 
be highly correlated? 

(c) Teacher rating is the only nonsignificant variable among the variables 
presented. Can you explain why? Note that T is a dummy variable 


“David A. Katz, “Faculty Salaries, Promotions and Productivity at a Large University,” The 
American Economic Review, June 1973, pp. 469-477. 
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indicating whether or not the professor ranked in the top 50% of all 
instructors by a vote of the students. 

(d) Would dropping the variable T from the equation change the signs of 
any of the other variables? 


Explanatory Variable 

Coefficient 

Standard 

Error 

F 

Books, B 

230 

86 

7.21 

Articles, A 

18 

8 

5.37 

Excellent articles, E 

102 

28 

13.43 

Dissertations, F 

489 

60 

66.85 

Public Service, P 

89 

38 

5.65 

Committees, C 

156 

49 

10.02 

Experience, Y 

189 

17 

126.92 

Teacher rating, T 

53 

370 

0.01 

English professors, D 4 

-2293 

529 

18.75 

Female, X 

-2410 

528 

20.80 

Ph.D. degree, R 

1919 

607 

10.01 


Constant = 11,155; R 2 = 0.68; n 

= 596 


Standard error of regression 

= 2946 


Mean of the dependent variable = 

15,679 


SD of the dependent variable 

= 5093 


(e) The variable F is the number of dissertations supervised since 1964. 
B is the number of books published. How do you explain the high 
coefficient for F relative to that of B? 

(f) Would you conclude from the coefficient of X that there is sex dis¬ 
crimination? 

(g) Compute the partial r 2 for experience. 

5. Estimate demand for food functions on the basis of the data in Table 4.9. 
Discuss if there is a multicollinearity problem and what you are going to do 
about it. 

6. Estimate demand for gasoline on the basis of the data in Table 4.8. Are the 
wrong signs for Pg a consequence of multicollinearity? 


Appendix to Chapter 7 


Linearly Dependent Explanatory Variables 

In Chapter 4 we assumed that the explanatory variables were linearly indepen¬ 
dent and hence that (X'X) 1 exists. What happens if the explanatory variables 
are linearly dependent? This is the case of perfect multicollinearity. In this case 
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(X'X) will be a singular matrix (its rank will be less than k). Hence we do not 
have a unique solution to the normal equations. However, consider two differ¬ 
ent solutions. Pi and p 2 , to the normal equations. We then have 

(X'X)P, = X'y 

t 

and 

(X'X)p 2 = X'y 

Premultiply the first equation by p 2 and the second equation by PI and subtract. 
Since p 2 X'Xp, = p;X'X0 2 (the transpose of a scalar is the same), we get the 
result that PJX'y = p 2 X'y; that is, the regression sum of squares is the same. 
Hence the residual sum of squares will be the same whatever solution we take. 

If (X'X) is singular, it means that not all the regression parameters p, are 
estimable, but only certain linear functions of the 3, are estimable. The question 
is: What linear functions are estimable? 

Let a be a A: x 1 vector that is a linear combination of the columns of (X'X). 
Thus a = (X'X)X. Then the linear function a'P is uniquely estimable. To see 
this consider any solution p of the normal equations. Then 

a'P = X'X'XP = X'X'y 

Thus a'0 is a unique linear function of y. Since L(X'X'y) = X'X'XP = a'P, 
we get the result that a'P is a unique unbiased linear estimator of a'P . It can 
also be shown that it has minimum variance among all linear unbiased esti¬ 
mators (the proof is similar to the one in the Appendix to Chapter 4). Thus it 
is BLUE. 

In case (X'X) is nonsingular, every k x 1 vector can be expressed as a linear 
combination of the columns of (X'X). Thus all linear functions a'P are uniquely 
estimable. Hence all the p, are uniquely estimable. As an illustration of perfect 
multicollinearity, consider the example in Section 7.2, where x 3 = x x + x 2 . In 
this case we have 


y = Pl*l + p2*2 + P3*3 + « 

= (Pi + Ps)*l + (P2 + P3)*2 + U 

Thus we can see that only p, + p 3 and p 2 + p 3 are estimable. In this case we 
have 


(X'X) = 


5 

0 

5 


0 5 

5 5 

5 10 


Take a = 5 (first column). Then a' = (1, 0, 1) and a'P = p, + p 3 . Hence p, 
+ p 3 is estimable. If we take a = 5 (second column). Then a' = (0, 1, 1 ) and 
a'P = p 2 + p 3 . Hence p 2 + p 3 is estimable. Can we estimate p, + p 2 + p 3 ? 
No, because we cannot get a' = (1, 1, 1) by taking any linear combination of 
the columns of (X'X). 
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Exercise 


As yet another example, consider the case 


XX = 


5 

2 

17 


2 

3 

9 


17 

9 

60 


Show that p, + 3p, and p 2 + p 3 are estimable. Is P, - p 2 + 2p 3 estimable? 
Is Pi + P 2 + P 3 estimable? 


The “Condition Number” Measure of Multicollinearity 
(Section 7.3) 

The preceding discussion referred to the case where (X'X) was singular. In 
actual practice the variables are not exactly linearly dependent but almost are. 
That is, (X'X) is close to singularity. The question is: How do we measure 
closeness? For this purpose the condition number has been suggested. It is 
defined as the square root of the ratio of the largest to the smallest eigenvalues 
(or characteristic roots) of the matrix (X'X). To understand this we have to 
define characteristic roots. 


Characteristic Roots and Vectors 

Let A be an n x n symmetric matrix. Consider minimizing the quadratic form 
x'Ax subject to the condition x'x = 1. Introducing the Lagrangian multipler X, 
we minimize 


x'Ax - X(x'x - 1) 

Differentiating with respect to x and equating the derivatives to zero, we get 

2Ax - 2Xx = 0 or (A - XI)x = 0 

In order that this set of equations should have a nonnull solution, we should 
have 


Rank(A - XI) < n or |A — XI| = 0 


The roots of this determinantal equation, which is called the characteristic 
equation, are called the characteristic roots of A (alternative terms are latent 
roots or eigenvalues). The determinental equation |A — XI| = 0 is an Hth-degree 
equation in X and has n roots. Corresponding to each solution X, there is a 
vector x, that is a solution of (A — X,I)x = 0. These vectors are called charac¬ 
teristic vectors (or latent vectors or eigenvectors). For instance, if A is 3 x 3 
matrix, we have to solve the equation 


a w ~ k a l2 
a 2\ a 22 — k 

« 31 a i2 


a n 

a 23 

« 33 — X 


= 0 


which is a cubic in X and has three roots. In the text we showed the calculation 
of the characteristic roots and the condition number for a 2 x 2 matrix. 
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As an example, consider the matrix 


A = 


4 

2 

3 


2 

1 

-4 


2 

1 

4 


Before we go to the characteristic roots of 3 x 3 matrix, let us consider the 
"4 2 


2x2 submatrix 


2 1 


. The characteristic equation is: 


4 - X 
2 


2 

1 - X 


= 0 


or (4 - X)(l -X)-4 = 0=^>X 2 -5X = 0=^> X(X - 5) = 0. Thus, the two 
roots are X = 0 and X = 5. Note that the sum of the roots is equal to the sum 
of the diagonal elements. 

For X = 0, the characteristic vector is obtained by solving the equations: 


4 

2 


1 

* 

__1 


" 0 " 

2 

1 


L**J 


0 


or 


4jc, + 2x 2 
2j| + x 2 ■ 


0 


This gives x 2 = - 2x,. Normalizing this by taking x] + x\ = 1 or x] + 4x\ = 
1, we get x, = 1/V5, x 2 = -2/V5- Thus, the characteristic vector is 
(1/V3, — 2V5). For the root X = 5, we have to solve the equations: 


U - 5 2 1 


X\ 


y 

—x, + 2x, = 0 

1 — 

1 

i_ 


*2 


0 

or 

2x, - 4x 2 = 0 


This gives x t = 2x 2 . Again normalizing using x 2 + x\ — 1 or 4x\ + x\ = 
1 , we get x 2 = 1/V5, x, = 2/V5- Hence, the characteristic vector is 
(2/V5, I/V5). Thus, we have the characteristic roots and vectors as: 

X = 0 =^> vector 
X = 5 =^> vector 



Note that the two vectors are orthogonal. Returning to the 3 x 3 matrix we 
have the characteristic equation 





= 0 


This gives X(X - 3)(X - 6) = 0. Thus X = 0, 3, and 6 are the characteristic 
roots. To get the characteristic vectors, we solve the equation (A - XI)x = 0. 
For X = 0 we have 


'4 2 2 


x, 


o' 

2 1 1 


x 2 

= 

0 

.3 -4 4_ 


-*3_ 


A 
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We get the equations 

2x, + x 2 + x 3 = 0 
3x, - 4x 2 + 4x } = 0 

This gives x 2 = |x, and x 3 = — yx, or x, = 1, x 2 = h, and x 3 = — y. We have 
to normalize this vector by taking x\ + x\ + x\ = 1. This gives the normalized 
vector as 

= /_J_5_n_\ 

x VV2T0’ V2T0’ V2T0/ 

For X = 3 we solve 

4 - 3 2 2 ] [*,"1 [0 

2 1-3 1 x 2 = 0 

3 - 4 4 - 3J x 3 J LO 

We get 

X) + 2x 2 + 2x 3 = 0 
2x, - 2x 2 + x 3 = 0 
3x, - 4x 2 + x 3 = 0 

This gives x 2 = |x, and x 3 = — x,. Taking x, = 1 and normalizing, we get the 
characteristic vector as x' = (§, 3, -§). For X = 6 we proceed similarly and 
get 

_2 _ 1 _ 1 _\ 

VS’ VS’ vs/ 

The example we have considered is that of a singular matrix (the first row is 
twice the second row). Hence one of the characteristic roots is zero. The ma¬ 
trix is also nonsymmetric. In this case the characteristic vectors are not or¬ 
thogonal to each other. In the case of a symmetric matrix, they are orthogonal 
(as proved later). As an example, consider the symmetric matrix 

"9 3 3 
A = 3 1 1 

_3 1 7 

Using the previous procedure we get the characteristic roots as X = 0, 5, and 
12. The corresponding characteristic vectors are: 

For X = 0: x', = ~ 

_3_1_5_\ 

V35’ V35’ V35J 



For X = 5: x' 2 = 
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/ 3 1 2 \ 

For X - 12: x 3 - ^ ^j 

Note that these vectors are orthogonal to each other. That is, x{x 2 = xjx, = 
x 2 x, = 0. 

Properties of Characteristic Roots and Vectors 

Let X M X 2 , . . . , X„ be the n characteristic roots and x t , x 2 , . . . , x„ be the 
corresponding characteristic vectors of the matrix A. We shall state some im¬ 
portant properties of the characteristic roots and vectors. 

1. The maximum value of x'Ax is the maximum characteristic root. 

Proof: x'Ax, = X,x'x, = X,. Hence the result follows. 

2. If X, and X 2 are two distinct characteristic roots, then x|x 2 = 0 or the 
corresponding characteristic vectors are orthogonal. 

Proof: Ax, = X,x,. Hence x 2 Ax, = X,x 2 x,. Ax 2 = X 2 x 2 . Hence x|Ax 2 = 
X 2 x|x 2 . By subtraction we get 0 = (X, - X 2 )xjx 2 . But since X, # X 2 we have 
xjx 2 = 0. 

3. |A| — X,X 2 * • - X n and Tr(A) = X, + X 2 + • • • + X„. 

Proof: Let X be the matrix whose columns are the characteristic vectors 
of A. That is, X = [x,, x 2 , . . . , xj. If the X, are all distinct, the columns 
of X are orthogonal. Also, x,'x, = 1 for all i. Thus X is an orthogonal 
matrix. Hence X -1 = X' (see the Appendix to Chapter 2) and X'X = I. 
Now A(x,, x 2 , . . . , x n ) = (X,x,, X 2 x 2 , . . . , X„x„) or AX = XD, where D 
is the diagonal matrix 

X, 

with all nondiagonal terms zero. Therefore, X'AX = X'XD = D. |X'AX| 

= |D| or |X'|-|A|-|X| = X,X 2 • • • X„. But since X is an orthogonal matrix, 
|X'|-|X| = 1. Hence we get |A| = X, X 2 • • • X„. Also, Tr(D) = X, + X 2 + 

• • • + X„ and Tr(X'AX) = Tr(AXX') = Tr(A) since XX' = I. Hence Tr(A) 

= X, + X 2 + • ■ • + x„. 

Note: Although we have proved the results above for distinct roots, the 
results are valid for even repeated roots. That is, given a symmetric ma¬ 
trix A, there exists an orthogonal matrix X (whose columns are the char¬ 
acteristic vectors of A) such that X'AX = D, where D is a diagonal matrix 
whose elements are the characteristic roots of A. 

4. Rank A = Rank(D) = the number of nonzero characteristic roots of A. 
Proof: Since rank is unaltered by pre or post multiplication by a nonsin¬ 
gular matrix, Rank(A) = Rank(X'AX) = Rank(D). 

5. The characteristic roots of A 2 are the squares of the characteristic roots 
of A, but the characteristic vectors are the same. 
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Proof: X'AX = Dor A = XDX'. A 2 = (XDX')(XDX') = XD 2 X' since X'X 
= I. Thus the characteristic roots are given by the diagonal elements of 
D 2 (i.e., X, 2 ) and the characteristic vectors are given by the columns of X. 

6 . If A is a positive definite matrix, all the characteristic roots are positive. 
Proof: Consider Ax, = X,x,. x,'Ax, > 0 if A is positive definite. Since x'x, 

= 1, we have \ > 0- By a similar argument we can show that if A is 
positive semidefinite, X, > 0. If A is negative definite, X, < 0 for all j. If A 
is negative semidefinite, X,<0 for all j. Note that for the symmetric ma¬ 
trix we considered earlier, the roots were 0, 5, and 12. Thus it is positive 
semidefinite. s 

The Case of a Nonsymmetric Matrix 

The preceding results are for symmetric matrices. In econometrics we encoun¬ 
ter nonsymmetric matrices as well (the case of VAR models in Chapter 14). 
For these matrices the characteristic vectors are not orthogonal, as we saw 
earlier. However, some of the other results are still valid. For example, the 
result sum of the characteristic roots of A = Tr(A) is valid. 

Consider the equations Axj = X,x,, Ax 2 = X 2 x 2 , and so on, which we solve 
to get the characteristic vectors x l5 x 2 , . . . . We can write these as 

A(X)X 2 • • • x„) = (XjX|, Xx 2 , . . . , X„x„) =^> AX = XD 

where X is the matrix whose columns are the characteristic vectors and D is a 
diagonal matrix with X,, X 2 , . . . , X„ as the diagonal elements. Premultiplying 
both sides by X -1 , we get 

X’AX = D 

(We have assumed that X is nonsingular. This can be proved, but we omit the 
proof here.) Thus given a square matrix A, we can find a nonsingular matrix X 
such that X'AX is a diagonal matrix with the characteristic roots of A as the 
diagonal elements. The columns of X are the corresponding characteristic vec- 

ft 

tors. Also, Tr(X ! AX) = Tr(D) = £ K ButTr(X ! AX) = Tr(AXX >) = Tr(A). 

i = I 

Hence Tr(A) = XX,. This can be checked with the two examples considered 
earlier. In the case of the nonsymmetric matrix, XX, = 9 and Tr(A) = 9. In the 
case of the symmetric matrix, XX, = 17 and Tr(A) = 17. 

Principal Components 

Consider a set of variables x t , x 2 , . . . , x„ with covariance matrix V. We want 
to find a linear function a'x that has maximum variance subject to a'a = 1. 
The problem is similar to the one we considered earlier. We have to solve 
|V — XI| = 0. The maximum characteristic root of V is the required maximum 
value and the corresponding characteristic vector is thd required a. 

Let us order the characteristic roots X,, X 2 , . . . , X„ in decreasing order; let 
the corresponding vectors be x,, x 2 , . . . , x„. Consider the linear functions z, 
= «;X, z 2 = a 2 X.. . . , z„ = <X. then V(z,) = a'jVa, = X„ V(z 2 ) = X 2 , . . . . 
The z’s are called the principal components of the x’s. 
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They have the following properties: 

1. var(z,) + var(z 2 ) + ■ ■ ■ + var(z„) = X, + X 2 + • • • + X„ = Tr(V) = 
var(x,) + var(x 2 ) + ■ • ■ var(x„). 

2. Since (a|Ot 2 • • • a„) are orthogonal vectors, z,, z 2 , . . . , z„ are orthogonal 
or uncorrelated. The drawbacks of principal component analysis have 
been discussed in the text. 

Ridge Regression 

If (X'X) is close to singularity, the problem can be solved by adding positive 
elements to the diagonals. The simple ridge estimator is 

P« = (X'X' + Xir'X'y 

There are several interpretations of this estimator (discussed in the text). One 
is to obtain the least squares estimator when ^ (3? = c. Introducing the La- 
grangian multiplier X, we minimize 

(y - X|i)'(y - Xp) + X(P'P - c) 

Differentiating with respect to p, we get 

— 2X'y + 2X'Xp + 2Xp = 0 or (X'X'+ XI)p = X'y 

This gives the ridge estimator. X is the Lagrangian multiplier or the “shadow 
price” of the constraint. 
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8 DUMMY VARIABLES AND TRUNCATED VARIABLES 


Let us remember the unfortunate econometrician who, in one of the major 
functions of his system, had to use a proxy for risk and a dummy for sex. 

Fritz Machlup 

Journal of Political Economy 
July!August 1974 


8.1 Introduction 


In the preceding chapters we discussed the estimation of multiple regression 
equations and several associated problems such as tests of significance, R 2 , 
heteroskedasticity, autocorrelation, and multicollinearity. In this chapter we 
discuss some special kinds of variables occurring in multiple regression equa¬ 
tions and the problems caused by them. The variables we will be considering 
are: 

1. Dummy variables. 

2. Truncated variables. 

We start with dummy explanatory variables and then discuss dummy depen¬ 
dent variables and truncated variables. Proxy variables referred to in Mach- 
lup’s quotation are discussed in Chapter 11. 

Dummy explanatory variables can be used for several purposes. They can 
be used to 

1. Allow for differences in intercept terms. 

2. Allow for differences in slopes. 

3. Estimate equations with cross-equation restrictions. 

4 . Test for stability of regression coefficients. 

We discuss each of these uses in turn. 


8.2 Dummy Variables for Changes 
in the Intercept Term 

Sometimes there will be some explanatory variables in our regression equation 
that are only qualitative (e.g., presence or absence of college education and 
racial, sex, or age differences). In such cases one often takes account of these 
effects by a dummy variable. The implicit assumption is that the regression 
lines for the different groups differ only in the intercept term but have the same 
slope coefficients. For example, suppose that the relationship between income 
y and years of schooling x for two groups is as shown in Figure 8.1. The dots 
are for group 1 and circles for group 2. 
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Note that the slopes of the regression lines for both groups are roughly the 
same but the intercepts are different. Hence the regression equations we fit will 
be 

{ ctj + fix + u for the first group . 

a 2 + fix + u for the second group ' ' ’ 

These equations can be combined into a single equation, 

y = a, + (a 2 — a t )D + fix + u (8.2) 

where 



for group 2 
for group 1 


The variable D is the dummy variable. The coefficient of the dummy variable 
measures the differences in the two intercept terms. 

If there are more groups, we have to introduce more dummies. For three 
groups we have 


a, + (3x + u for group 1 

y — ■ a 2 + fix + u for group 2 

a 3 + fix + u for group 3 


These can be written as 


y ~ Q| + (a 2 — oqfD, + (a 3 — a,)£» 2 + (3x + u 


(8.3) 
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where 



for group 2 
for groups 1 and 3 

for group 3 
for groups 1 and 2 


It can be easily checked that by substituting the values for D, and D 2 in (8.3), 
we get the intercepts a,, a 2 , a 3 , respectively for the three groups. Note that in 
combining the three equations, we are assuming that the slope coefficient p is 
the same for all groups and that the error term u has the same distribution for 
the three groups. 

If there is a constant term in the regression equation, the number of dummies 
defined should always be one less than the number of groupings by that cate¬ 
gory because the constant term is the intercept for the base group and the 
coefficients of the dummy variables measure differences in intercepts, as can 
be seen from equation (8.3). In that equation the constant term measures the 
intercept for the first group, the constant term plus the coefficient of D, mea¬ 
sures the intercept for the second group, and the constant term plus the coef¬ 
ficient of D 2 measures the intercept for the third group. We have chosen group 
1 as the base group, but any one group may be chosen. The coefficients of the 
dummy variables measure the differences in the intercepts from that of the base 
group. If we do not introduce a constant term in the regression equation, we 
can define a dummy variable for each group, and in this case the coefficients 
of the dummy variables measure the intercepts for the respective groups. If we 
include both the constant term and three dummies, we will be introducing per¬ 
fect multicollinearity and the regression program will not run (or will omit one 
of the dummies automatically). 

As yet another example, suppose that we have data on consumption C and 
income Y for a number of households. In addition, we have data on: 

1. S: the sex of the head of the household. 

2. A: age of the head of the household, which is given in three categories: 

< 25 years, 25 to 50 years, and > 50 years. 

3. E: education of the head of household, also in three categories: < high 
school, 2: high school but < college degree, a college degree. 

We include these qualitative variables in the form of dummy variables: 



if sex is male 
if female 

if age < 25 years 
otherwise 

if age between 25 and 50 years 
otherwise 
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A = 
A = 


ft 

{j 


if < high school degree 
otherwise 

if a high school degree but < college degree 
otherwise 


For each category the number of dummy variables is one less than the number 
of classifications. 

Then we run the regression equation 


C = a + 0F + y,D, + y 2 D 2 + y 3 A + 74 A + 7sA + u 

The assumption made in the dummy-variable method is that it is only the in¬ 
tercept that changes for each group but not the slope coefficients (i.e., coeffi¬ 
cients of Y). 

The intercept term for each individual is obtained by substituting the appro¬ 
priate values for D, through D 5 . For instance, for a male, age < 25, with college 
degree, we have D t = 1 , A = 1 , A = 0, £> 4 = 0, D s = 0 and hence the 
intercept is a + 7 , + y 2 . For a female, age > 50 years, with a college degree, 
we have D, = 0, D 2 = 0, D 3 = 0, D A = 0, D 5 = 0 and hence the intercept term 
is just a. 

The dummy-variable method is also used if one has to take care of seasonal 
factors. For example, if we have quarterly data on C and Y, we fit the regression 
equation 


C ~ cl + f^Y + k l D l + k 2 D 2 + k 3 Z) 3 + u 


where D u D 2 , and A are seasonal dummies defined by 



for the first quarter 
for others 

for the second quarter 
for others 

for the third quarter 
for others 


If we have monthly data, we use 11 seasonal dummies: 



for January 
for others 

for February 
for others etc. 


If we feel that, say, December (because of Christmas shopping) is the only 
month with strong seasonal effect, we use only one dummy variable: 


D = 


t 


for December 
for other months 
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Illustrative Example 

The Environmental Protection Agency (EPA) publishes auto mileage estimates 
that are designed to help car buyers compare the relative fuel efficiency of 
different models. Does the EPA estimate provide all the information necessary 
for comparing the relative fuel efficiency of the different models? To investigate 
this problem Lovell 1 estimated the following regressions. 

y = 7.952 + 0.693EPA R 2 = 0.74 

(1735) (0 061) 


(Figures in parentheses are standard errors.) 

y = 22.008 - 0.002W - 2.760 S/A + 3.280 GID + 0.415EPA R 2 = 0.82 

(5.349) (0.001) (0 708) (1413) (0 097) 


where y = 

W = 
SIA) = 

GID = 
EPA = 


miles per gallon as reported by Consumer Union 

based on road tests 

weight of the vehicle (pounds) 

dummy variable equal to 0 for standard transmission 
and 1.0 for automatic transmission 

dummy variable equal to 0 for gas and 1.0 for diesel power 
mileage estimate by the EPA 


The variables W, S/A, GID all have correct signs and are significant, showing 
that the EPA did not use all the information available in giving its estimates on 
fuel efficiency. 


Two More Illustrative Examples 

We will discuss two more examples using dummy variables. They are meant to 
illustrate two points worth noting, which are as follows: 

1. In some studies with a large number of dummy variables it becomes 
somewhat difficult to interpret the signs of the coefficients because they 
seem to have the wrong signs. The first example illustrates this problem. 

2. Sometimes the introduction of dummy variables produces a drastic 
change in the slope coefficient. The second example illustrates this point. 

The examples are rather old and outdated but they establish the points we 
wish to make. 

The first example is a study of the determinants of automobile prices. 
Griliches 2 regressed the logarithm of new passenger car prices on various spec- 

'M. C. Lovell, “Tests of the Rational Expectations Hypothesis,” The American Economic Re¬ 
view, March 1986, p. 120. 

2 Z. Griliches, “Hedonic Price Indexes for Automobiles: An Econometric Analysis of Quality 
Change,” Government Price Statistics, Hearings, U.S. Congress, Joint Economic Committee 
(Washington, D.C.: U.S. Government Printing Office, 1961). Further results on this problem 
can be found in M. Ohta and Z. Griliches, “Automobile Prices Revisited: Extensions of the 
Hedonic Hypothesis,” in N. Terleckyj (ed.). Household Behavior and Consumption (New 
York: National Bureau of Economic Research, 1975). 
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Table 8.1 Determinants of Prices of Automobiles" 


Coefficient of f 

I960 

1959 

1957 

H 

0.119 

(0.029) 

0.118 

(0.029) 

0.117 

(0.030) 

W 

0.136 

(0.046) 

0.238 

(0.034) 

0.135 

(0.010) 

L 

0.015 

(0.017) 

-0.016 

(0.015) 

0.039 

(0.013) 

V 

-0.039 

(0.025) 

-0.070 

(0.039) 

-0.025 

(0 023) 

T 

0.058 

(0.016) 

0.027 

(0.019) 

0.028 

(0.012) 

A 

0.003 

(0.040) 

0.063 

(0.038) 

0.114 

(0.025) 

P 

0.225 

(0.037) 

0.188 

(0.041) 

0.078 

(0.030) 

B 



0.159 

(0.026) 

C . 

0.048 

(0.039) 



R 2 

0.951 

0.934 

0.966 


“Figures in parentheses are standard errors. 
k H, advertised brake horsepower (hundreds); W, shipping 
weight (thousand of pounds); L, overall length in (tens of 
inches); V, 1 if the car has a V-8 engine, = 0 if it has a 
six-cylinder engine; T, 1 if the car is hard top, = 0 if not; 
A, 1 if automatic transmission is “standard” (i.e., 
included in price), = 0 if not; P, 1 if power steering is 
“standard,” = 0 if not; B, 1 if power brakes are 
“standard,” = 0 if not; C, 1 if car is designated as a 
“compact,” = 0 if not. 


ifications. The results are shown in Table 8.1. Since the dependent variable is 
the logarithm of price, the regression coefficients can be interpreted as the es¬ 
timated percentage change in the price for a unit change in a particular quality, 
holding other qualities constant. For example, the coefficient of H indicates 
that an increase in 10 units of horsepower, ceteris paribus, results in a 1.2% 
increase in price. However, some of the coefficients have to be interpreted with 
caution. For example, the coefficient of P in the equation for 1960 says that the 
presence of power steering as “standard equipment” led to a 22.5% higher 
price in 1960. In this case the variable P is obviously not measuring the effect 
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of power steering alone but is measuring the effect of “luxuriousness” of the 
car. It is also picking up the effects of A and B. This explains why the coeffi¬ 
cient of A is so low in 1960. In fact, A, P, and B together can perhaps be 
replaced by a single dummy that measures “luxuriousness.” These variables 
appear to be highly intercorrelated. Another coefficient, at first sight puzzling, 
is the coefficient of V, which, though not significant, is consistently negative. 
Though a V-8 costs more than a six-cylinder engine on a “comparable” car, 
what this coefficient says is that, holding horsepower and other variables con¬ 
stant, a V-8 is cheaper by about 4%. Since the V-8’s have higher horsepower, 
what this coefficient is saying is that higher horsepower can be achieved more 
cheaply if one shifts to V-8 than by using the six-cylinder engine. It measures 
the decline in price per horsepower as one shifts to V-8’s even though the total 
expenditure on horsepower goes up. This example illustrates the use of dummy 
variables and the interpretation of seemingly wrong coefficients. 

As another example consider the estimates of liquid-asset demand by man- 
ufacturingcorporations. Vogel and Maddala 3 computed regressions of the form 
log C = a + p log S, where C is the cash and S the sales, on the basis of data 
from the Internal Revenue Service, “Statistics of Income,” for the year 1960- 
1961. The data consisted of 16 industry subgroups and 14 sizes classes, size 
being measured by total assets. When the regression equations were estimated 
separately for each industry, the estimates of p ranged from 0.929 to 1.077. The 
R 2, s were uniformly high, ranging from 0.985 to 0.998. Thus one might conclude 
that the sales elasticity of demand for cash is close to 1. Also, when the data 
were pooled and a single equation estimated for the entire set of 224 observa¬ 
tions, the estimate of p was 0.992 and R 1 = 0.987. When industry dummies 
were added, the estimate of P was 0.995 and R 2 = 0.992. From the high R 2 's 
and relatively constant estimate of p one might be reassured that the sales elas¬ 
ticity is very close to 1. However, when asset-size dummies were introduced, 
the estimate of p fell to 0.334 with R 2 of 0.996. Also, all asset-size dummies 
were highly significant. The situation is described in Figure 8.2. That the sales 
elasticity is significantly less than 1 is also confirmed by other evidence. This 
example illustrates how one can be very easily misled by high R 2 ’s and apparent 
constancy of the coefficients. 


8.3 Dummy Variables for Changes 
in Slope Coefficients 


In Section 8.2 we considered dummy variables to allow for differences in the 
intercept term. These dummy variables assume the values zero or 1. Not all 
dummy variables are of this form. We can use dummy variables to allow for 


5 R. C. Vogel and G. S. Maddala, “Cross-Section Estimates of Liquid Asset Demand by Man¬ 
ufacturing Corporations,” The Journal of Finance, December 1967. 
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Figure 8.2. Bias due to omission of dummy variables. 


differences in slope coefficients as well. For example, if the regression equa¬ 
tions are: 


and 


y, = a, + P,jc, + u, for the first group 


y 2 = a 2 + p 2 x 2 + u 2 for the second group 
we can write these equations together as 

y, = a, + (a 2 - a,) • 0 + PjJV, + (p 2 - p,) ■ 0 + u x 
y 2 = a, + (a 2 - a,) • 1 + p,x 2 + (p 2 - P,) • x 2 + u 2 

or 

a, + (a 2 - a,)£>, + Pjjv + (p 2 - p,)£> 2 + u (8.4) 

for all observations in the first group 
for all observations in the second group 

for all observations in the first group 
i.e., the respective value of jc for the second group 

The coefficient of D, measures the difference in the intercept terms and the 
coefficient of D 2 measures the difference in the slope. Estimation of equation 
(8.4) amounts to estimating the two equations separately if we assume that the 
errors have an identical distribution. If we delete D 2 from equation (8.4), this 
amounts to allowing for different intercepts but not different slopes, and if we 
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delete D x , it amounts to allowing for different slopes but not different inter¬ 
cepts. 

Suitable dummy variables can be defined when there are changes in slopes 
and intercepts at different times. Suppose that we have data for three periods 
and in the second period only the intercept changed (there was a parallel shift). 
In the third period the intercept and the slope have changed. Then we write 

y, = a, + p,x, + w, for period 1 

$ 

y 2 = a 2 + P,a 2 + u 2 for period 2 (8.5) 

y 3 = a 3 + p 2 jr 3 + « 3 for period 3 

Then we can combine these equations and write the model as 


y ~ a, + (a 2 — a,)D, + (a 3 - a X )D 2 + p,x + (p 2 - P,)A + u (8.6) 


I 1 for observations in period 2 

[0 for other periods 

II for observations in period 3 

[0 for other periods 

rO for observations in periods 1 and 2 
j a 3 or the respective value of x for 
t all observations in period 3 
Note that in all these examples we are assuming that the error terms in the 
different groups all have the same distribution. That is why we combine the 
data from the different groups and write an error term u as in (8.4) or (8.6) and 
estimate the equation by least squares. 

An alternative way of writing the equations (8.5), which is very general, is 
to stack the y-variables and the error terms in columns. Then write all the 
parameters a„ a 2 , a 3 , p„ p 2 down with their multiplicative factors stacked in 
columns as follows: 


where D x 

A 

A 




+ a 2 



+ a 3 



+ P, 



+ p 2 



+ 



What this says is 

y, = a,(l) + a 2 (0) + a 3 (0) + p,(x,) + p 2 (0) + u x 

y 2 = a,(0) + a 2 (l) + a 3 (0) + P,(x 2 ) + p 2 (0) + u 2 

y 3 = 01,(0) + a 2 (0) + a 3 (l) + p,(0) + P 2 (x 3 ) + u 3 


where ( ) is used for multiplication, e.g., a 3 (0) = a 3 x 0. 
Now we can write these equations as 


(8.7) 


y — ol x D x + ol 2 D 2 + a 3 Z) 3 + p,f )4 + P 2 A T w (8.8) 

where the definitions of D x , D 2 , D 3 , D 4 , D s are clear from (8.7). For instance, 

„ _ (1 for observations in the second group 
2 10 for all others 
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£> 4 = 



corresponding values of x for observations in groups 1 and 2 
for all observations in group 3 


Note that equation (8.8) has to be estimated without a constant term. 

In this method we define as many dummy variables as there are parameters 
to estimate and we estimate the regression equation with no constant term. We 
will give an illustrative example in the next section, where this method is ex¬ 
tended to take care of cross-equation constraints. 

Note that equations (8.6) and (8.8) are equivalent. The method of writing the 
equation in terms of differences in the parameters is useful in tests for stability 
discussed in Section 8.5. 


8.4 Dummy Variables for 

Cross-Equation Constraints 


The method described in Section 8.3 can be extended to the case where some 
parameters across equations are equal. As an illustration, consider the joint 
estimation of the demand for beef, pork, and chicken on the basis of data pre¬ 
sented in Table 8.2. 4 Waugh estimates a set of demand equations of the form 


P i = a, + Pnx, + p 12 x 2 + P13X3 + -y,y + «, 

Pi = « 2 + Pl2*l + P22*2 + P23*3 + hy + «2 

Pi = <*3 + Pl3*l + P 23 *2 + P 3 3*3 + 'Yj)’ + U 1 


where p, 
Pi 
Pi 

*i 

*2 

*3 

y 


= retail price of beef 
= retail price of pork 
= retail price of chicken 
= consumption of beef per capita 
= consumption of pork per capita 
= consumption of chicken per capita 
= disposable income per capita 


(8.9) 


x,, x 2 , and x 3 are given in Table 8.2. The prices in Table 8.2 are, however, retail 
prices divided by a consumer price index. Hence we multiplied them by the 
consumer price index p to get p u p 2 , p 3 . This index p and disposable income y 
are as follows: 



P 

y 


P 

y 


P 

y 

1948 

0.838 

1291 

1953 

0.932 

1582 

1958 

1.007 

1826 

1949 

0.830 

1271 

1954 

0.936 

1582 

1959 

1.015 

1904 

1950 

0.838 

1369 

1955 

0.934 


1960 

1.031 

1934 

1951 

0.906 

1473 

1956 


1742 

1961 

1.041 

1980 

1952 

0.925 

1520 

1957 



1962 

1.054 

2052 


‘The data are from F. V. Waugh, Demand and Price Analysis: Some Examples from Agriculture, 
V.S.D.A. Technical Bulletin 1316, November 1964, Table 5-1. p. 39. 








Table 8.2 Per Capita Consumption and Deflated Prices of Selected Meats, 1948—1963 _ 

_ Beef _ Pork Lamb Veal Chicken 

Price Price Price Price Price 

Consumption per Consumption per Consumption per Consumption per Consumption per 
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‘Divided by consumer price index (1957-1959 = 100). 

‘1963 data are preliminary and were not used in the analysis. 
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The special thing about the system of equations (8.9) is the symmetry in the 
(3 coefficients. We have 


dPi _ dPi _ o dp2 _ dPi 

dx 2 dxi 12 dx 3 dx i 13 dx 3 dx 2 


Thus there are cross-equation restrictions on the coefficients. If we assume that 
V(i/j) = V(u 2 ) = V(u 3 ), we can minimize u\ + 2 u \ + 2 obtain the 
normal equations, and estimate the regression coefficients. This is the method 
used by Waugh. This method involves working out the necessary algebraic 
expressions and programming things afresh. Instead, we can use the standard 
regression programs by using the dummy variable method. We can write equa¬ 
tions (8.9) as a single equation: 




+ a 2 





+ 


0 / 

+ &22 (*2 

\0/ 


+ P25 


+ P33 



+ 



+ 72 



+ 73 



+ 



We ran this equation with 45 observations and the 12 dummies (no constant 
term). The values of the dummy variables are easily generated; for example, 
the set of observations for p, 2 consists of the 15 observations of x 2 followed by 
the 15 observations of x l and 15 zeros. The results (with t ratios in the paren¬ 
theses) are 5 


118.98 

- 1.534*, 

- 0.474x 2 

- 0.445*3 

+ 0.0650y 

(12 00) 

(14 55) 

(4 31) 

(3 01) 

(12 61) 

149.79 

- 0.474*, 

- 1.189*2 

- 0.319*3 

+ 0.0162y 

(9.18) 

(4 31) 

(6 20) 

(1 54) 

(2 83) 

131.06 

- 0.445*, 

- 0.319* 2 

- 2.389*3 

+ 0.0199y 

(7 36) 

(3 01) 

(1 54) 

(4 32) 

(1 66) 


One can raise questions about the appropriateness of the specification of the 
system of demand functions (8.9). Our purpose here has been merely to illus¬ 
trate the use of dummy variable methods to estimate equations where some 
parameters in different equations are the same. 


’The results are almost the same as those obtained by Waugh. Part of the difference could be 
that our program is in double precision. > 
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8.5 Dummy Variables for Testing Stability 
of Regression Coefficients 


Dummy variables can also be used to test for stability of regression coefficients 
as discussed in Section 4.11. The definition of the appropriate dummy variables 
depends on whether we are using the analysis of covariance test or the predic¬ 
tive test for stability. We will first discuss the analysis-of-covariance test. 

Consider, for instance, the two equations 

y, = oq + p,*, + -y,z, + u, for the first period 


y 2 = a 2 + p 2 L 2 + y 2 z 2 + u 2 for the second period 

We may be interested in testing the hypothesis that none of the coefficients 
changed between the two time periods; only the intercept coefficients changed, 
or only the intercept and the coefficient of the jc- variable changed, and so on. 
As discussed in Section 4.11, the analysis-of-variance test depends on obtain¬ 
ing the unrestricted and restricted residual sums of squares. Both these residual 
sums of squares can be obtained from the same dummy variable regression if 
we define enough dummy variables. 

For instance, we can write the equations for the two periods as 


y = a, + (a 2 - a,)£>, + p,* + (p 2 - p,)/);, + y,z + (y 2 - y,)D 3 + u (8.10) 


Note that we write the equation in terms of differences in the parameters and 
define the dummy variables accordingly. 


A = 
d 2 = 

A = 


{J 

{? 

{o 


for period 2 
for period 1 

i.e., the corresponding value of x for observations in period 2 
for all observations in period 1 

i.e., the corresponding value of z for all observations in period 2 
for all observations in period 1 


The unrestricted residual sum of squares is the one from estimating (8.10). As 
for the restricted residual sum of squares, it is obtained by deleting the dummy 
variables corresponding to that hypothesis. 



Hypothesis 

Variables Deleted 

(1) 

All coefficients same 

A, A» A 


«i = « 2 , p, = p 2 , -y, = y 2 


(2) 

Only intercepts change 

Pi = P 2 , 71 = 72 

A, A 

(3) 

Only intercepts and 
coefficients of z change 

P. = P 2 

A 
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There are some who argue in favor of estimating equations like (8.10) and 
checking which of the dummy variables is significant in preference to the Chow 
test discussed in Chapter 4 . 6 However, we should be cautious in making infer¬ 
ences about stability and instability of the coefficients by looking at the /-ratios 
of the dummy variables alone. As we pointed out in our discussion of R 2 (Sec¬ 
tion 4.10) it is possible that the /-ratios for each of a set of coefficients are all 
insignificant and still the F-ratio for the entire set of coefficients is significant. 
What one should do in any particular example is to use the F-tests and then 
use the /-tests on individual dummy variables only if they correspond to eco¬ 
nomically meaningful hypotheses. 

As discussed in Section 4.11, the analysis-of-covariance test cannot be used 
if n 2 < k. In this case the predictive test suggested by Chow is to use 

(RSS - RSS,)/n 2 
~ RSSj/(«, - k - 1) 

as an F-variate with degrees of freedom n 2 and «, — k — 1. Here RSS is the 
residual sum of squares with («, + n 2 ) observations and RSS, is the residual 
sum of squares with «, observations. 

The F-test, however, does not tell which of the n 2 observations contribute to 
the instability of the coefficients or are outliers. To do this, we can define a set 
of n 2 dummy variables defined as 7 

n _ 11 for observation n, + i . _ „ 

' [0 for other observations 1 ~ ’ ’ ‘ ’ ' ’ 

and test whether the coefficients of the dummy variables are zero. Since one 
can get the standard error for each of these dummy variables separately from 
the standard regression packages, one can easily check for outliers and see 
which of the observations are significantly outside the regression line estimated 
from the first n , observations. 

The common regression parameters will be estimated from the first n ] obser¬ 
vations, the coefficient of the /th dummy variable for i = «, + 1 , . . . , n 2 will 
measure the prediction error for the prediction of this observation based on the 
coefficients estimated from the first n x observations, and the standard error of 
this coefficient will measure the standard error of this prediction error. 

Consider, for instance, 

{ a + P,X| + 02*2 + « for the first n, observation 

a + p,jc, + p 2 x 2 + y, + u for the («, + l)th observation 

a + 3 ,jc, + 02*2 + 7 2 + u for the (n x + 2 )th observation 


6 D. Gujarati, “Use of Dummy Variables in Testing for Equality of Sets of Coefficients in Two 
Linear Regressions: A Note," American Statistician. February 1970. Gujarati argues that the 
Chow test might reject the hypothesis of stability but not tell us which particular coefficients 
are unstable, whereas the dummy variable method gives this information. 

7 D. S. Salkever, “The Use of Dummy Variables to Compute Predictions. Prediction Errors and 
Confidence Intervals,” Journal of Econometrics, Vol. 4, 1976, pp. 393-397. 
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Then minimizing the sum of squares 

m 

V ip ip ip 

jLj +1 ' u n2+i 

i — 1 

we get a, p t , and p 2 from the minimization of , u] and 

% = Yj - a - Pi*,, - $ 2 x 2 j for j = n x + 1 and «, + 2 

What we have considered here for tests for stability is the use of dummy 
variables to get within sample predictions. We can use the dummy variable 
method to generate out of sample predictions and their standard errors as well 
(a problem we discussed in Section 4.7). Suppose we have n observations on 
Y, x x and x 2 . We are given the (n + l)th observation on x t and x 2 and asked to 
get the predicted value of Y and a standard error for the prediction. The way 
we proceed is as follows. 

Set the value of Y for the (n + l)th observation at zero. Define a dummy 
variable D as 

D = 0 for the first n observations 
= — 1 for the (n + l)th observation 

Now run a regression of Y on x„ x 2 and D using the ( n + 1) observations. 
The coefficient of D is the prediction F„ +1 needed and its standard error is the 
standard error of the prediction. To see this is so, note that the model says: 

Y = a + 3,jc, + (3 2 x 2 + u for the n observations 


0 = a + 3,x, + p 2 x 2 — y + u for the (n + l)th observation 

Minimizing the residual sum of squares amounts to obtaining the least squares 
estimators &, j-J,, p 2 from the first n observations, and 

y = a + p,x, + p 2 x 2 for the (n + l)th observation 

Thus, y = , | and its standard error gives us the required standard error of 

I n+\- 

This method has been extended by Pagan and Nicholls 8 to the case of non¬ 
linear models and simultaneous equation models. 


8.6 Dummy Variables Under 

Heteroskedasticity and Autocorrelation 


In the preceding sections we discussed the use of dummy variables but we need 
to exercise some caution in the use of these variables when we have hetero- 
skedasticity or autocorrelation. 

8 A. R. Pagan and D. F. Nicholls, “Estimating Predictions, Prediction Errors and Their Standard 
Deviations Using Constructed Variables,” Journal of Econometrics, Vol. 24, 1984, pp. 293- 
310. 


8.6 DUMMY VARIABLES UNDER HETEROSKEDASTICITY AND AUTOCORRELATION T>2\ 


Consider first the case of heteroskedasticity. Suppose that we have the two 
equations 


y = 


a, + + «i 

a 2 + + u 2 


for the first group 
for the second group 


Let varfw,) = a} and var(« 2 ) = aj. When we pool the data and estimate an 
equation like (8.4), we are implicitly assuming that u\ - cr 2 - If and cr 2 are 
widely different, then, even if a 2 is not significantly different from a, and (3 2 is 
not significantly different from (3,, the coefficients of the dummy variables in 
(8.4) can turn out to be significant. One can easily demonstrate this by gener¬ 
ating data for the two groups imposing a, = a 2 and (1, = (1 2 but o] = 16 <t 2 (cr, 
being chosen suitably), and estimating equation (8.4). The reverse situation can 
also arise; that is, ignoring heteroskedasticity can make significant differences 
appear to be insignificant. Suppose that we take a, = 2a 2 and (3, = 2(3 2 . Then 
by taking of = l6oi (or a multiple around that) we can make the dummy vari¬ 
ables appear nonsignificant. The problem is just the same as that of applying 
tests for stability under heteroskedasticity. 

Regarding autocorrelation, suppose that the errors in the equations for the 
two groups are first-order autoregressive so that we use a first-order autore¬ 
gressive transformation, that is, write 

y, = y, - py,-! fory, 

jt* = x, — px,_i for x, 

The question is: What happens to the dummy variables in equation (8.4)? 
These variables should not be subject to the autoregressive transformation and 
care should be taken if the computer program we use does this automatically. 9 
We can easily derive the appropriate dummy variables in this case. 

Consider the case with n , observations in the first group and n 2 observations 
in the second group. We will introduce the time subscript t for each observation 
later when needed. 

Define 


oq e «i(l - p) 

«2 = £*20 “ P) 

Then equation (8.4) can be rewritten as 

y* = cq + («; - cq)Z>, + 3,Y* + (f3 2 - 3,)Z> 2 + e 

where D 2 will be defined as before with x* 2 in place of x 2 and the random errors 
e, are defined by 

u, ~ P«<-i = e, 


This point was brought to my attention by Ben-Zion Zilberfarb through a manuscript “On the 
Use of Autoregressive Transformation in Equations Containing Dummy Variables. ” 
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This equation, however, is all right for the observations in the first group and 
the last (n 2 - I) observations in the second group. However, the problem is 
with the first observation in the second group. For this observation, the p- 
differenced equation turns out to be 

y, - py<-i = - P«i + (02*< ~ PPl*r-l) + 

or 


y* = «; + -j-(«2* - 

1 - p 


«i) + Pi* f * + (p 2 ~ Pi)*/ + e t 


This means that the dummy variables D , and D 2 have to be defined as follows: 


A = 


A = 



for all observations in the first group 

for the first observation in the second group 

for all the other observations in the second group 

for all observations in the first group 

for the first observation in the second group 

for all the other observations in the second group 


8.7 Dummy Dependent Variables 


Until now we have been considering models where the explanatory variables 
are dummy variables. We now discuss models where the explained variable is 
a dummy variable. This dummy variable can take on two or more values but 
we consider here the case where it takes on only two values, zero or 1. Con¬ 
sidering the other cases is beyond the scope of this book. 10 Since the dummy 
variable takes on two values, it is called a dichotomous variable. There are 
numerous examples of dichotomous explained variables. For instance, 

_ 11 if a person is in the labor force 
y 10 otherwise 

v -r ^ 

or 


y = 


l 


if a person owns a home 
otherwise 


There are several methods to analyze regression models where the dependent 
variable is a zero or 1 variable. The simplest procedure is to just use the usual 


l0 This is a complete area by itself, so we give only an elementary introduction. A more detailed 
discussion of this topic can be found in G. S. Maddala, Limited-Dependent and Qualitative 
Variables in Econometrics (Cambridge: Cambridge University Press, 1983), Chap. 2, “Discrete 
Regression Models;” T. Amemiya, “Qualitative Response Models; A Survey,” Journal of Eco¬ 
nomic Literature, December 1981, pp. 483-536; and D. R. Cox, Analysis of Binary Data (Lon¬ 
don: Methuen, 1970). 
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least squares method. In this case the model is called the linear probability 
model. Another method, called the “linear discriminant function,” is related to 
the linear probability model. The other alternative is to say that there is an 
underlying or latent variable y* which we do not observe. What we observe is 


y = 


{ 


if y* > 0 
otherwise 


This is the idea behind the logit and probit models. First we discuss these meth¬ 
ods and then give an illustrative example. 


8.8 The Linear Probability Model and the 
Linear Discriminant Function 


The Linear Probability Model 


The term linear probability model is used to denote a regression model in which 
the dependent variable y is a dichotomous variable taking the value 1 or zero. 
For the sake of simplicity we consider only one explanatory variable, x. 

The variable y is an indicator variable that denotes the occurrence or non¬ 
occurrence of an event. For instance, in an analysis of the determinants of 
unemployment, we have data on each person that shows whether or not the 
person is employed, and we have some explanatory variables that determine 
the state of employment. Here the event under consideration is employment. 
We define the dichotomous variable 


y = 


i 


if the person is employed 
otherwise 


Similarly, in an analysis of bankruptcy of firms, we define 


y = 


{ 


if the firm is bankrupt 
otherwise 


We write the model in the usual regression framework as 


y, = 0 *, + «, ( 8 . 11 ) 

with E(u,) = 0. The conditional expectation E(y\x) is equal to (3*,. This has to 
be interpreted in this case as the probability that the event will occur given the 
jc,. The calculated value of y from the regression equation (i.e., y, = (3x,) will 
then give the estimated probability that the event will occur given the particular 
value of x. In practice these estimated probabilities can lie outside the admis¬ 
sible range (0, 1). 

Since y, takes the value 1 or zero, the errors in equation (8.11) can take only 
two values, (1 — (3x,) and (-p.r,). Also, with the interpretation we have given 
equation (8.11), and the requirement that E(u,) = 0, the respective probabilities 
of these events are and (1 - £k ( ). Thus we have 
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U , 

/(«,) 

1 - 3jc, 

3*; 

-3*.- 

(l - 3*. ) 


Hence 


var(«,) = (3 jc,( 1 - (Bx,) 2 + (1 - |3x,)(-0x,) 2 
= (3x,( 1 - 3*,) 

* £(?,)[ i - £(y,)] 

Because of this heteroskedasticity problem the ordinary least squares (OLS) 
estimates of 3 from equation (8.11) will not be efficient. We can use the follow¬ 
ing two-step procedure. 11 

First estimate (8.11) by least squares. 

Next compute y,fl - SD and use weighted least squares, that is, defining 

w, = Vy,(l - y,) 

we regress yjw, on xjw,. The problems with this procedure are: 

1. $(1 - y,) in practice may be negative, although in large samples this will 
be so with a very small probability 12 since y,(I — y,) is a consistent esti¬ 
mator for £(y,)([ 1 - £(y,)]). 

2. Since the errors u, are obviously not normally distributed, there is a prob¬ 
lem with the application of the usual tests of significance. As we will see 
in the next section, on the linear discriminant function, they can be jus¬ 
tified only under the assumption that the explanatory variables have a 
multivariate normal distribution. 

3. The most important criticism is with the formulation itself: that the con¬ 
ditional expectation E(y, |jc f ) be interpreted as the probability that the event 
will occur. In many cases £(y,|x,) can lie outside the limits (0, 1). 

The limitations of the linear probability model are shown in Figure 8.3, which 
shows the bunching up of points 13 along y = 0 and y = 1. The predicted values 
can easily lie outside the interval (0, 1) and the prediction errors can be very 
large. 

In the 1960s and early 1970s the linear probability model was widely used 
mainly because it is a model that can be easily estimated using multiple regres¬ 
sion analysis. Some others used discriminant analysis, not noting that this 
method is very similar to the linear probability model. For instance, Meyer and 

"A. S. Goldberger, Econometric Theory (New York: Wiley, 1964), p. 250. 

I2 R. G. McGilvray, “Estimating the Linear Probability Function," Econometrica, 1970, pp. 
775-776. 

u This bunching of points is described in M. Nerlove and S. J. Press, “Univariate and Multi¬ 
variate Log-Linear and Logistic Models.” Report R-1306-EDA/N1H. Rand Corporation. Santa 
Monica, Calif., December 1973. 
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Figure 8.3. Predictions from the linear probability model. 


Pifer 14 analyzed bank failures using the linear probability model and Altman 15 
analyzed bankruptcy of manufacturing corporations using discriminant analy¬ 
sis. In both studies the bankrupt banks or corporations were taken and a paired 
sample of nonbankrupt banks or corporations were chosen (i.e., for each bank¬ 
rupt bank or corporation, a similarly situated nonbankrupt bank or corporation 
was found). Then the linear probability model or linear discriminant function 
was estimated. 

Since the linear probability model and linear discriminant function are 
closely related, we will discuss the latter here. 


The Linear Discriminant Function 

Suppose that we have n individuals for whom we have observations on k ex¬ 
planatory variables and we observe that «, of them belong to a group it, and n 2 
of them belong to a second group it, where n, + n 2 = n. We want to construct 
a linear function of the k variables that we can use to predict that a new obser¬ 
vation belongs to one of the two groups. This linear function is called the linear 
discriminant function. 

As an example suppose that we have data on a number of loan applicants 
and we observe that n, of them were granted loans and n 2 of them were denied 
loans. We also have the socioeconomic characteristics on the applicants. Now 

“Paul A. Meyer and Howard W. Pifer, “Prediction of Bank Failures,” Journal of Finance. 1970. 
pp. 853-868. 

“Edward I. Altman, “Financial Ratios, Discriminant Analysis and the Prediction of Corporate 
Bankruptcy,” Journal of Finance, September 1968. 
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given a new applicant with specified socioeconomic characteristics, we would 
want to predict whether the applicant will get a loan or not. 

Let us define a linear function. 

k 

Z = X 0 4- 2 ^f*i 

1=1 

Then it is intuitively clear that to get the best discrimination between the two 
groups, we would want to choose the A, so that the ratio 

between-group variance of Z 

——-:-——- is maximum 

withm-group variance or Z 


Fisher 16 suggested an analogy between this problem and multiple regression 
analysis. He suggested that we define a dummy variable 

——— if the individual belongs to it, (first group) 

n, + n 2 


y = 




n x + n 2 

Now estimate the multiple regression equation 


if the individual belongs to tt 2 (second group) 


y - Po + Pl*l + 02*2 + • • • + P/fcX/t + U 

Get the residual sum of squares RSS. Then 

RSS 

0, = K —T- i 

n \ + n 2 — 2 

Thus, once we have the regression coefficients and the residual sum of squares 
from the dummy dependent variable regression, we can very easily obtain the 
discriminant function coefficients. 17 

The linear probability model is only slightly different from the formulation of 
Fisher. In the linear probability model we define 


y = 


l 


if the individual belongs to it, 
if the individual belongs to tt 2 


This merely amounts to adding n,/(n t + n 2 ) to each observation of y as de¬ 
fined by Fisher. Thus only the estimate of the constant term changes. 


,6 R. A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eu¬ 
genics, 1936, pp. 179-188. 

l7 We are omitting the algebraic details here. They can be found in Maddala, Limited-Dependent, 
pp. 18-21. Also, the tests of significance for the coefficients of the linear discriminant function 
or the linear probability model discussed later are described there. 



8 9 THE PROBIT AND LOGIT MODELS 


327 


8.9 The Probit and Logit Models 


An alternative approach is to assume that we have a regression model 



+ «, 


( 8 . 12 ) 


where y* is not observed. It is commonly called a “latent” variable. What we 
observe is a dummy variable y, defined by 


y, = 


if y* > 0 
otherwise 


(8.13) 


The probit and logit models differ in the specification of the distribution of the 
error term u in (8.12). The difference between the specification (8.12) and the 
linear probability model is that in the linear probability model we analyze the 
dichotomous variables as they are, whereas in (8.12) we assume the existence 
of an underlying latent variable for which we observe a dichotomous realiza¬ 
tion. For instance, if the observed dummy variable is whether or not the person 
is employed, y* would be defined as “propensity or ability to find employment.” 
Similarly, if the observed dummy variable is whether or not the person has 
bought a car, then y* would be defined as “desire or ability to buy a car.” Note 
that in both the examples we have given, there is “desire” and “ability” in¬ 
volved. Thus the explanatory variables in (8.12) would contain variables that 
explain both these elements. 

Note from (8.13) that multiplyingy* by any positive constant does not change 
y,. Hence if we observe y„ we can estimate the p’s in (8.12) only up to a positive 
multiple. Hence it is customary to assume var(w,) = 1. This fixes the scale of 
y*. From the relationships (8.12) and (8.13) we get 


P, = Prob(y, = 1) = 


= 1 - F 




+ 



where F is the cumulative distribution function of u. 

If the distribution of u is symmetric, since 1 - F(-Z) = F(Z), we can write 


P. 



(8.14) 


Since the observed y, are just realizations of a binomial process with probabil¬ 
ities given by (8.14) and varying from trial to trial (depending on x,), we can 
write the likelihood function as 
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l= n p, n a - p* (8 i5) 

v,= l v, = 0 


The functional form for F in (8.14) will depend on the assumption made about 
the error term u. If the cumulative distribution of u, is logistic we have what is 
known as the logit model. In this case 


Hence 


F{Z) = 


exp (Z,) 

1 + exp(Z,) 


(8.16) 


log 


FiZ) 

1 - F(Z) 


= 


Note that for the logit model 

P * 

log -j- '-J = Pt) + 2 \ 

1 ~ 1 = I 

The left-hand side of this equation is called the log-odds ratio. Thus the log- 
odds ratio is a linear function of the explanatory variables. For the linear prob¬ 
ability model it is P, that is assumed to be a linear function of the explanatory 
variables. 

If the errors u, in (8.12) follow a normal distribution, we have the probit 
model (it should more appropriately be called the normit model, but the word 
probit was used in the biometrics literature). In this case 


F(Z,) = 



^fc eXp 



(8.17) 


Maximization of the likelihood function (8.15) for either the probit or the logit 
model is accomplished by nonlinear estimation methods. There are now several 
computer programs available for probit and logit analysis, and these programs 
are very inexpensive to run. 

The likelihood function (8.15) is concave 18 (does not have multiple maxima), 
and hence any starting values of the parameters would do. It is customary to 
start the iterations for the logit and probit models with the estimates from the 
linear probability model. 

Since the cumulative normal and the logistic distributions are very close to 
each other except at the tails, we are not likely to get very different results 
using (8.16) or (8.17), that is, the logit or the probit method, unless the samples 
are large (so that we have enough observations at the tails). However, the es¬ 
timates of the parameters (3, from the two methods are not directly comparable. 
Since the logistic distribution has a variance n 2 /3, the estimates of (3, obtained 
from the logit model have to be multiplied by \/3/tt to be comparable to the 


'This is proved for a general model in J. W. Pratt, “Concavity of the Log-Likelihood,” Journal 
of the American Statistical Association, 1981, pp. 137-159. 



8.9 THE PROBIT AND LOGIT MODELS 


329 


estimates obtained from the probit mode! (where we normalize ct to be equal 
to I). 

Amemiya 19 suggests that the logit estimates be multiplied by 1/1.6 = 0.625 
instead of \/3/it, saying that this transformation produces a closer approxima¬ 
tion between the logistic distribution and the distribution function of the stan¬ 
dard normal. He also suggests that the coefficients of the linear probability 
model p l p and the coefficients of the logit model (3, are related by the relations: 

P Lp ~ ().25fj, except for the constant term 
P Lp — 0.25p, + 0.5 for the constant term 

Thus if we need to make f3 Lp comparable to the probit coefficients, we need to 
multiply them by 2.5 and subtract 1.25 from the constant term. 

Alternative ways of comparing the models would be: 

1. To calculate the sum of squared deviations from predicted probabilities. 

2. To compare the percentages correctly predicted. 

3. To look at the derivatives of the probabilities with respect to a particular 
independent variable. 


Illustrative Example 


As an illustration, we consider data on a sample of 750 mortgage applications 
in the Columbia, South Carolina, metropolitan area. 20 There were 500 loan ap¬ 
plications accepted and 250 loan applications rejected. We define 


y = 


[ 


if the loan application was accepted 
if the loan application was rejected 


Three models were estimated: the linear probability model, the logit model, and 
the probit model. The explanatory variables were: 


A1 = applicant’s and coapplicant’s income (10 3 dollars) 
XMD = debt minus mortgage payment (10 3 dollars) 

DF = dummy variable, 1 for female, 0 for male 
DR = dummy variable, 1 for non white, 0 for white 
DS = dummy variable, 1 for single, 0 otherwise 
DA = age of house (10 2 years) 

NNWP = percent nonwhite in the neighborhood (x 10 3 ) 
NMFI = neighborhood mean family income (10 5 dollars) 
NA = neighborhood average age of homes (10 2 years) 


"T. Amemiya, “Qualitative Response Model: A Survey,” Journal of Economic Literature, 
1981, p. 1488. 

“This example is from G. S. Maddala and R. P. Trost, “On Measuring Discrimination in Loan 
Markets,” Housing Finance Review, 1982, pp. 245-268. 
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Table 8.3 Comparison of the Probit, Logit, and Linear Probability Models: Loan 
Data from South Carolina" 


Variable 

Linear 

Probability 

Model 

Logit Model 

Probit Model 

AI 

1.489(4.69) 

2.254 (4.60) 

2.030 (4.73) 

XMD 

-1.509(5.74) 

-1.170(5.57) 

-1.773 (5.67) 

DF 

0.140 (0.78) 

0.563 (0.87) 

0.206 (0.95) 

DR 

-0.266 (1.84) 

-0.240(1.60) 

-0.279(1.66) 

DS 

-0.238 (1.75) 

-0.222 (1.51) 

-0.274(1.70) 

DA 

-1.426 (3.52) 

-1.463 (3.34) 

-1.570(3.29) 

NNWP 

-1.762 (0.74) 

-2.028(0.80) 

-2.360(0.85) 

NMFI 

0.150(0.23) 

0.149(0.20) 

0.194(0.25) 

NA 

-0.393 (1.34) 

-0.386 (1.25) 

-0.425(1.26) 

Constant 

0.501 

0.363 

0.488 


"Figures in parentheses are f-ratios not standard errors. 


The results are presented in Table 8.3. 

The coefficients of the probit model were left as they were computed. The 
other coefficients were adjusted as follows: 21 

1. The coefficients of the logit model were multiplied by 0.625. 

2. The coefficients of the linear probability model were multiplied by 2.5 
throughout and then 1.25 was subtracted from the constant term. 

These are the adjustments described in the text. The three sets of coefficients 
reported in Table 8.3 are not much different from each other (particularly those 
of the logit and the probit models). 

One can compare the three models by comparing the R 2 ' s. We will illustrate 
this with another example in the next section after defining the different mea¬ 
sures of R 2 's for the qualitative dependent variable models. 


The Problem of Disproportionate Sampling 

In many applications of the logit, probit, or linear probability models it happens 
that the number of observations in one of the groups is much smaller than the 
number in the other group. For instance, in an analysis of bank failures, the 
number of failed banks would be much smaller than the number of solvent 
banks. In a study of unemployment, the number of unemployed persons is 
much smaller than the number of employed persons. Thus either we have to 
get a very large data set (which is what is done in the studies of unemployment 

21 After getting the estimates from the three models, it is always desirable to adjust the coeffi¬ 
cients so that they are all on a comparable level. The linear probability model can be estimated 
by any multiple regression program. As for the logit and probit models, there are many com¬ 
puter programs available now (TSP, RATS, LIMDEP etc.). 
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based on census tapes) or we have to sample the two groups at different sam¬ 
pling rates. For instance, in an analysis of bank failures, all the failed banks are 
considered in the analysis, but only a small percentage of the solvent banks are 
sampled. Thus the two groups are sampled at different rates. In the example of 
loan applications in Columbia, South Carolina, (we have presented estimates 
in Table 8.3) the sampling was actually at different rates. There were 4600 ap¬ 
plications in the accepted category and 250 applications in the rejected cate¬ 
gory. To have enough observations on females and blacks in the rejected cat¬ 
egory, it was decided to include all the 250 observations from the rejected 
category and get a random sample of 500 observations from the accepted cat¬ 
egory. Thus the sampling rate was 100% for the rejected group and 10.87% for 
the accepted group. 

In such cases a question arises as to how one should analyze the data. It has 
been commonly suggested that one should use a weighted logit (or probit or 
linear probability) model similar to the weighted least squares method we dis¬ 
cussed under heteroscedasticity in Chapter 5. For instance, Avery and Han- 
weck argue: 22 “In addition, because failed and non-failed banks were sampled 
at different rates, it was also necessary to weight observations in estimation” 
(p. 387). However, this is not a correct procedure. The usual logit model can 
be used without any change even with unequal sampling rates. Thus the results 
presented in Table 8.3 are based on the usual estimation procedures with no 
weighting used. 

Regarding the estimation of the coefficients of the explanatory variables, if 
we use the logit model, the coefficients are not affected by the unequal sampling 
rates for the two groups. It is only the constant term that is affected. 23 In Table 
8.3 the logit coefficients are all correct, except for the constant term which 
needs to be decreased by log p t - log p 2 , where p, and p 2 are the proportions 
of observations chosen from the two groups for which y — 1 and 0, respec¬ 
tively, and the logarithm is the natural logarithm. In the example in Table 8.3, 
the constant term for the logit model has to be decreased by log (0.1087) - log 
(1.0) — -2.22. Since the coefficients in Table 8.3 have been adjusted (as de¬ 
scribed earlier), we have to multiply this by 0.625. Thus the decrease in the 
constant terms is - 1.39, that is an increase of 1.39. 

Note that this result is valid for the logit model, not for the probit model or 
the linear probability model. However, even for these models, although one 
cannot derive the results analytically, it appears that the slope coefficients are 
not much affected by unequal sampling rates. 

Weighting the observations is the correct procedure if there is a heteroske- 
dasticity problem. There is no reason why the unequal sampling proportions 
should cause a heteroskedasticity problem. Thus weighting the observations is 
clearly not a correct solution. If our interest is mainly in examining which vari- 


“Robert B. Avery and Gerald A. Hanweck, “A Dynamic Analysis of Bank Failures,” Bank 
Structure and Competition (Chicago: Federal Reserve Bank of Chicago, 1984), pp. 380-395. 
25 For a discussion of this point, see Maddala, Limited-Dependent, pp. 90-91. On p. 91 there is 
an error. The constant term should be decreased (not increased). 
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ables are significant, we need not make any changes in the estimated coeffi¬ 
cients for the logit model. On the other hand, if the estimated model is going 
to be used for prediction purposes, an adjustment in the constant term, as sug¬ 
gested earlier, is necessary. 

Prediction of Effects of Changes in the 
Explanatory Variables 

After estimating the parameters 0„ we would like to know the effects of 
changes in any of the explanatory variables on the probabilities of any obser¬ 
vation belonging to either of the two groups. These effects are given by 

for the linear probability model 
for the logit model 
for the probit model 


k 

0o + E 0A,., 

and 4>(‘) is the density function of the standard normal. 

In the case of the linear probability model these derivatives are constant. In 
the case of the logit and probit models, we need to calculate them at different 
levels of the explanatory variables to get an idea of the range of variation of 
the resulting changes in probabilities. If one is interested in the prediction of 
the effect on the log-odds ratio, then for the logit model, this effect is constant 
since 



Measuring Goodness of Fit 

There is a problem with the use of conventional /? 2 -type measures when the 
explained variable y takes on only two values. 24 The predicted values y are 
probabilities and the actual values y are either 0 or 1. For the linear probability 
model and the logit model we have 2 y = 2 as w >th the linear regression 
model, if a constant term is also estimated. For the probit model there is no 
such exact relationship although it is approximately valid. We will see this in 
the illustrative example presented later. 

There are several /? 2 -type measures that have been suggested for models with 
qualitative dependent variables. The following are some of them. In the case 
of the linear regression model, they are all equivalent. However, they are not 
equivalent in the case of models with qualitative dependent variables. 


dP, 

dx„ 


0 , 

0^,(1 - P.) 
10,4>(Z,) 


where 


Z, = 


24 These are summarized in Maddala, Limited-Dependent, pp 37-41. 
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1. R 2 = squared correlation between y and y. 

2. Measures based on residual sum of squares. For the linear regression 
model we have 


R 2 = 1 


n 


‘S O', - y,) r 

t= 1 _ 

-Ety, - y) 2 ~ 

t= 1 


We can use this same measure if we can use 2"=i O', - y,) 2 as the measure of 
residual sum of squares. Effron 2<i argued that we can use it. 

Note that in the case of a binary dependent variable 


S 0, - j) 2 = 2 >? - nf = n i 



«,n 2 

n 


Hence Effron’s measure of R 2 is 


R 2 = 1 


n,n 2 ,= 


S O', ~ V,) 2 


Amemiya 26 argues that it makes more sense to define the residual sum of 
squares as 


y 0 , - £> 2 

,=i y,(i - >’,) 

that is, to weight the squared error (y, — y,) 2 by a weight that is inversely 
proportional to its variance. 

3. Measures based on likelihood ratios. For the standard linear regression 
model. 


k 

y = 00 + 2 Pa, + u u IN(0, a 2 ) 
1—1 


let L ur be the maximum of the likelihood function when maximized with re¬ 
spect to all the parameters and L R be the maximum when maximized with the 
restriction (3, = 0 for i = 1,2 , . . . , k. Then 



One can use an analogous measure for the logit and probit model as well. 
However, for the qualitative dependent variable model, the likelihood function 
(8.15) attains an absolute maximum of 1. This means that 

f-K — F ur :£ 1 


2i B. Effron, "Regression and ANOVA with Zero-One Data: Measures of Residual Variation, 
Journal of the American Statistical Association, May 1978, pp 113-121 
“Amemiya, “Qualitative Response Models,” p. 1504. 
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or 


L R < < 1 

C(JR 


or 


or 


< 1 - R 1 < 1 


0<ft 2 <l - Ljf' 

Hence Cragg and Uhler 27 suggest a pseudo i? 2 : (It lies in [0, 1].) 

Jlbi _ T 2/n 
^UR f 'K 


pseudo R 2 


(1 - L 22 « 


Another measure of A 2 is that of McFadden, 28 who defines it as 


McFadden’s R 2 = 1 — ^ ^ UR 

log L r 

However, this measure does not correspond to any R 2 measure in the linear 
regression model. 

4. Finally, we can also think of R 2 in terms of the proportion of correct pre¬ 
dictions. Since the dependent variable is a zero or 1 variable, after we compute 
the y, we classify the ith observation as belonging to group 1 if y, > 0.5 and 
classify it as belonging to group 2 if f, < 0.5. We can then count the number of 
correct predictions. We can define a predicted value y*, which is also a zero- 
one variable such that 

fl if y, >0.5 
y ' [0 if y, <0.5 

(Provided that we calculate y, to enough decimals, ties will be very unlikely.) 
Now define 


number of correct predictions 

count R 2 = -:-———-:- 

total number of observations 

Although this is a useful measure worth reporting in all problems, it might not 
have enough discriminatory power. In the illustrative example in the next sec¬ 
tion, we found that the logit model predicted 41 of the total 44 cases correctly, 
whereas the probit and linear probability model each predicted 40 of the 44 
correctly. However, looking at the linear probability model had five obser¬ 
vations with jp, substantially greater than 1, thus outside the range of (0, 1). 

27 J. G. Cragg and R. Uhler, “The Demand for Automobiles,” Canadian Journal of Economics, 
1970, pp. 386-406. See also Maddala, Limited-Dependent, pp. 39-40 for a discussion of this 
Pseudo-/? 7 . 

28 D. McFadden, “The Measurement of Urban Travel Demand," Journal of Public Economics, 
1974, pp. 303-328. 
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Also, this measure did not appear to help us much in discriminating between 
the three models as the other measures of R 2 ’s did. It is, however, possible that 
this measure has better discriminatory power in other problems. In any case, 
it is a measure worth reporting in every problem. 


8.10 Illustrative Example 


Consider the data in Table 8.4. The data refer to a study on the deterrent effect 
of capital punishment by McManus. 29 The data are cross-sectional data for 44 
states in the United States in 1950. There are two dummy variables in the data. 
D 2 , which is a South-North dummy, is clearly an explanatory variable. But Z), 
can be both an explained and explanatory variable. If it is an explained variable 
we would consider it as “a propensity to have capital punishment.” • 

Let us first consider the regression of M on all the other variables. The re¬ 
sults are as follows: (Figures in parentheses are t-ratios, not standard errors.) 

ifir = -8.50 - 3.696 PC - 3.568 PX + 2.598£», 

(-0.82) (-1 38) (-0.54) (2.11) 

—0.0187" - 4.095 Y + 0.400JLF 

(2 62) (-2 3) (182) 

+ 6.444NW + 2.541Z) 2 R 2 = 0.7746 

(1.17) (I 93) 

Some of the coefficients have signs opposite to those we would expect. 

Let us now consider treating D, as an explained variable. We will consider 
T, Y, LF, NW, and D 2 as the explanatory variables. 

The linear probability model gave the following results (figures in parenthe¬ 
ses are /-ratios obtained from an ordinary regression program that ignores the 
zero-1 characteristic of the dependent variable): 

D, = 1.993 + 0.00146T + 0.658 Y - 0.055 LF 

(1.50) (1 46) (2 74) (-1 93) 

+ 1.9887V W + 0.343F> 2 R 2 = 0.3376 

(2.62) (I 81) 

What these results indicate is that southern states and states with higher per¬ 
centages of nonwhites have a positive effect on the probability of having capital 
punishment. The percentage of labor force employed has a negative effect on 
the probability of having capital punishment. What is perplexing is the coeffi¬ 
cient of Y (median family income), which is significantly positive. One possible 
explanation for this is that states with high incomes (New York, California, 
etc.) also have big cities where crime rates are high. 


^Walter S. McManus, “Estimates of the Deterrent Effect of Capital Punishment: The Impor¬ 
tance of the Researcher’s Prior Beliefs,” Journal of Political Economy, Vol. 93, April 1985, pp. 
417-425. 


336 


8 DUMMY VARIABLES AND TRUNCATED VARIABLES 


Let us now look at the logit and probit estimates. 30 The logit model gave 
(figures in parentheses are asymptotic t-ratios) 

D, = 10.99 + 0.0194T + 10.61 Y - 0.668 LF + 70.99 NW + 13.33Z> 2 

(0 53) (1 87) (1 88) (-140) (1 95) (0 02) 

The results of the probit model were (figures in parentheses are asymptotic 
t-ratios) 

t) x = 6.92 + 0.0113T + 6.46 y - 0.409LF + 42.50AW + 4.63D 2 

(0.61) (2.00) (2 05) (-1 59) (2 05) (0 04) 

As mentioned earlier, the logit coefficients have to be divided by 1.6 to be 
comparable to the probit coefficients. Such division produces the coefficients 
6.87, 0.0121, 6.63, -0.418, 44.37, and 8.33, respectively which are close to the 
probit coefficients. Surprisingly, D 2 is not significant, but all the other coeffi¬ 
cients have the same signs as in the linear probability model. The coefficient of 
Y is still positive and is significant. 


Table 8.4 Determinants of Murder Rates in the United States (Cross-Section Data on 


States in 1950)“ 


N 

M 

PC 

PX 


1 

19.25 

0.204 

0.035 

1 

2 

7.53 

0.327 

0.081 

I 

3 

5.66 

0.401 

0.012 

1 

4 

3.21 

0.318 

0.070 

1 

5 

2.80 

0.350 

0.062 

1 

6 

1.41 

0.283 

0.100 

1 

7 

6.18 

0.204 

0.050 

1 

8 

12.15 

0.232 

0.054 

1 

9 

1.34 

0.199 

0.086 

l 

10 

3.71 

0.138 

0 

0 

11 

5.35 

0.142 

0.018 

1 

12 

4.72 

0.118 

0.045 

1 

13 

3.81 

0.207 

0.040 

1 

14 

10.44 

0.189 

0.045 

1 

15 

9.58 

0.124 

0.125 

1 

16 

1.02 

0.210 

0.060 

1 

17 

7.52 

0.227 

0.055 

1 

18 

1.31 

0.167 

0 

0 

19 

1.67 

0.120 

0 

0 

20 

7.07 

0.139 

0.041 

1 

21 

11.79 

0.272 

0.063 

1 

22 

2.71 

0.125 

0 

0 

23 

13.21 

0.235 

0.086 

1 

24 

3.48 

0.108 

0.040 

1 

25 

0.81 

0.672 

0 

0 


T 

Y 

LF 

NW 

D 

47 

1.10 

51.2 

0.321 

1 

58 

0.92 

48.5 

0.224 

1 

82 

1.72 

50.8 

0.127 

0 

100 

2.18 

54.4 

0.063 

0 

222 

1.75 

52.4 

0.021 

0 

164 

2.26 

56.7 

0.027 

0 

161 

2.07 

54.6 

0.139 

1 

70 

1.43 

52.7 

0.218 

1 

219 

1.92 

52.3 

0.008 

0 

81 

1.82 

53.0 

0.012 

0 

209 

2.34 

55.4 

0.076 

0 

182 

2.12 

53.5 

0.299 

0 

185 

1.81 

51.6 

0.040 

0 

104 

1.35 

48.5 

0.069 

1 

126 

1.26 

49.3 

0.330 

1 

192 

2.07 

53.9 

0.017 

0 

95 

2.04 

55.7 

0.166 

1 

245 

1.55 

51.2 

0.003 

0 

97 

1.89 

54.0 

0.010 

0 

177 

1.68 

52.2 

0.076 

0 

125 

0.76 

51.1 

0.454 

1 

56 

1.96 

54.0 

0.032 

0 

85 

1.29 

55.0 

0.266 

1 

199 

1.81 

52.9 

0.018 

0 

298 

1.72 

53.7 

0.038 

0 


,0 The logit and probit estimates were computed using William Greene’s LIMDEP program. 
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Table 8.4 (Coni.) 


N 

M 

PC 

PX 

A 

T 

Y 

LF 

NW 

A 

26 

2.32 

0.357 

0.030 

1 

145 

2.39 

55.8 

0.067 

0 

27 

3.47 

0.592 

0.029 

1 

78 

1.68 

50.4 

0.075 

0 

28 

8.31 

0.225 

0.400 

1 

144 

2.29 

58.8 

0.064 

0 

29 

1.57 

0.267 

0.126 

1 

178 

2.34 

54.5 

0.065 

0 

30 

4.13 

0.164 

0.122 

1 

146 

2.21 

53.5 

0.065 

0 

31 

3.84 

0.128 

0.091 

1 

132 

1.42 

48.8 

0.090 

1 

32 

1.83 

0.287 

0.075 

1 

98 

1.97 

54.5 

0.016 

0 

33 

3.54 

0.210 

0.069 

1 

120 

2.12 

52.1 

0.061 

0 

34 

1.11 

0.342 

0 

0 

148 

1.90 

56.0 

0.019 

0 

35 

8.90 

0.133 

0.216 

1 

123 

1.15 

56.2 

0.389 

1 

36 

1.27 

0.241 

0.100 

1 

282 

1.70 

53.3 

0.037 

0 

37 

15.26 

0.167 

0.038 

1 

79 

1.24 

50.9 

0.161 

1 

38 

11.15 

0.252 

0.040 

1 

34 

1.55 

53.2 

0.127 

1 

39 

1.74 

0.418 

0 

0 

104 

2.04 

51.7 

0.017 

0 

40 

11.98 

0.282 

0.032 

1 

91 

1.59 

54.3 

0.222 

1 

41 

3.04 

0.194 

0.086 

1 

199 

2.07 

53.7 

0.026 

0 

42 

0.85 

0.378 

0 

0 

101 

2.00 

54.7 

0.012 

0 

43 

2.83 

0.757 

0.033 

1 

109 

1.84 

47.0 

0.057 

1 

44 

2.89 

0.356 

0 

0 

117 

2.04 

56.9 

0.022 

0 


“TV, observation number; M , murder rate per 100,000, FBI estimate 1950; PC, (number of 
convictions/number of murders) in 1950; PX. average number of executions during 1946-1950 
divided by convictions in 1950: Y, median family income of families in 1949 (thousands of 
dollars); LF, labor force participation rate 1950 (expressed as a percent); NW, proportion of 
population that is nonwhite in 1950: £L, dummy variable, 1 for southern states, 0 for others; 
D,. dummy variable which is 1 if the state has capital punishment, 0 otherwise (£>, = 1 if PX 
> 0, 0 otherwise); T, median time served in months of convicted murderers released in 1951. 


One other problem is that, as mentioned in Section 8.9, the coefficients of 
the logit model should be approximately four times the coefficients of the linear 
probability model, but the coefficients we have obtained are much higher than 
that. One possible reason for this is the poor fit given by the linear probability 
model. To investigate this we computed the different measures of R 1 's dis¬ 
cussed in the preceding section, and the R 1 's for the linear probability model 
are significantly lower than those for the logit and probit models. 

In Table 8.5 we present four different measures of 7? 2 ’s. 31 The first two are 
easy to compute and are reasonable measures of 7? 2 ’s. The measures suggested 
by Cragg and Uhler and by McFadden both depend on the computation of JL R 
and L m . The results indicate that there is not much to choose between the logit 


J, We did not compute Amemiya’s R 2 's. Although he has given an expression for the residual 
sum of squares, he has not given an expression for the total sum of squares (which should also 
be appropriately weighted). Using the unweighted total sum of squares 2(y, - y) 2 produces a 
negative R 2 . 
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Table 8.5 Different R 2 Measures for the Logit, Probit, and Linear Probability Models 



Logit 

Probit 

Linear 

Probability 

Squared correlation between 

D i and A 

0.6117 

0.6099 

0.3376 

Effron’s R 1 

0.6116 

0.6095 

0.3376 

Cragg-Uhler’s R 2 

0.7223 

0.7258 

0.5273 

McFadden’s R 2 

0.6083 

0.6124 

0.4029 


and probit models and that both are better than the linear probability model. 
From the practical point of view it appears that the squared correlation between 
D, and J9, and Effron’s R 2 are sufficient for many problems. 

Since we decided on the probit and logit models and D 2 was not significant 
in these models, we decided to drop that variable and reestimate the probit and 
logit models. The revised estimates were (figures in parentheses are asymptotic 
f-ratios) 

Logit 

A = 16.57 + 0.0165T + 9.13F - 0.715JLF + 85.36 NW 

(0.84) (1.72) (1.81) (-1.49) (2.38) 

R 2 (D„ £>,) = 0.5982 Effron’s R 2 = 0.5982 

Cragg-Uhler’s R 2 = 0.7077 McFadden’s R 2 = 0.5914 

Probit 

A = 10.27 + 0.0094F + 5.55F - 0.437FF + 50.25NW 

(0.98) (1.86) (1.97) (- 1.7) (2.50) 

R 2 (D„ D,) = 0.5950 Effron’s R 2 = 0.5947 

Cragg-Uhler’s R 2 = 0.7113 McFadden’s R 2 = 0.5955 

Again, to make the logit coefficients comparable to the probit coefficients, we 
have to divide the former by 1.6. This gives 10.36, 0.0103, 5.71, -0.447, and 
53.35, respectively, which are close to the probit coefficients. 


8.11 Truncated Variables: The Tobit Model 


In our discussion of the logit and probit models we talked about a latent vari¬ 
able y* which was not observed, for which we could specify the regression 
model: 


yj = 0x, + u, (8.18) 

For simplicity of exposition we are assuming that there is only one explanatory 
variable. In the logit and probit models, what we observe is a dummy variable 
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fl if y* > 0 
* = (o if y* ^ 0 


Suppose, however, that y* is observed if y* > 0 and is not observed if y* 
Then the observed y, will be defined as 


0. 


» - {o ; 


px, + «/ 
1N(0, O 2 ) 


if y* > 0 

ify,r<0 


(8.19) 


This is known as the tobit model (Tobin’s probit) and was first analyzed in the 
econometrics literature by Tobin. 32 It is also known as a censored normal 
regression model because some observations on y* (those for which y* < 0) 
are censored (we are not allowed to see them). Our objective is to estimate the 
parameters (3 and a. 


Some Examples 


The example that Tobin considered was that of automobile expenditures. Sup¬ 
pose that we have data on a sample of households. We wish to estimate, say, 
the income elasticity of demand for automobiles. Let y* denote expenditures 
on automobiles and x denote income, and we postulate the regression equation 

yj = px, + «, ~ IN(0, tr 2 ) 


However, in the sample we would have a large number of observations for 
which the expenditures on automobiles is zero. Tobin argued that we should 
use the censored regression model. We can specify the model as 


Px, + Ui 
0 


for those with positive automobile expenditures 
for those with no expenditures ^ ' 


The structure of this model thus appears to be the same as that in (8.19). 

There have been a very large number of applications of the tobit model. 33 
Take, for instance, hours worked (H) or wages (W). If we have observations 
on a number of individuals, some of whom are employed and others not, we 
can specify the model for hours worked as 


H, = 


px, + u, 
0 


for those working 

for those who are not working 


Similarly, for wages we can specify the model 


| yZj + V; for those working 
[0 for those who are not working 


( 8 . 21 ) 


( 8 . 22 ) 


32 J. Tobin, “Estimation of Relationships for Limited Dependent Variables,” Econometrica, Vol. 
26, 1958, pp. 24-36. 

3, See T. Amemiya, “Tobit Models: A Survey,” Journal of Econometrics, Vol. 24, 1984, pp. 3- 
61, which lists a large number of applications of the tobit model. 
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The structure of these models again appears to be the same as in (8.19). How¬ 
ever, there are some limitations in the formulation of the models in (8.20)-(8.22) 
that we will presently outline after discussing the estimation of the model in 
(8.19). 

Method of Estimation 

Let us consider the estimation of (3 and a by the use of ordinary least squares 
(OLS). We cannot use OLS using the positive observations y, because when 
we write the model 


y, = P*, + «. 

the error term u, does not have a zero mean. Since observations with y* < 0 
are omitted, it implies that only observations for which u,> — (3x, are included 
in the sample. Thus the distribution of u, is a truncated normal distribution 
shown in Figure 8.4 and its mean is not zero. In fact, it depends on (3, a, and 
x, and is thus different for each observation. A method of estimation commonly 
suggested is the maximum likelihood (ML) method, which is as follows: 

Note that we have two sets of observations: 

1. The positive values of y, for which we can write down the normal density 
function as usual. We note that (y, — fix,)!a has a standard normal distri¬ 
bution. 

2. The zero observations of y for which all we know is that y* — 0 or (3x, + 
u, < 0. Since uja has a standard normal distribution, we will write this 
as uja < -(Px,)/ct. The probability of this can be written as F(-(Px,)/ct), 
where F(z) is the cumulative distribution function of the standard normal. 

Let us denote the density function of the standard normal by /(•) and the 
cumulative distribution function by F( ). Thus 

/» - vfe exp 
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and 


F(z) 


/: 


mdt 


Using this notation we can write the likelihood function for the tobit model as 


1 Jy> 


l - n 

v.>o cr 


zi5!) nd-E?' 

o- / v.so V a 


Maximizing this likelihood function with respect to (3 and cr, we get the maxi¬ 
mum likelihood (ML) estimates of these parameters. We will not go through 
the algebraic details of the ML method here. 34 Instead, we discuss the situa¬ 
tions under which the tobit model is applicable and its relationship to other 
models with truncated variables. 


Limitations of the Tobit Model 

Consider the models of automobile expenditures in (8.20), of hours worked in 
(8.21), and of wages in (8.22). In each case there are zero observations on some 
individuals in the sample and thus the structure of the model looks very similar 
to that in (8.19). But is it really? Every time we have some zero observations 
in the sample, it is tempting to use the tobit model. However, it is important to 
understand what the model in (8.19) really says. What we have in model (8.19) 
is a situation where y‘ can, in principle, take on negative values. However, we 
do not observe them because of censoring. Thus the zero values are due to 
nonobservability. This is not the case with automobile expenditures, hours 
worked, or wages. These variables cannot, in principle, assume negative val¬ 
ues. The observed zero values are due not to censoring, but due to the deci¬ 
sions of individuals. In this case the appropriate procedure would be to model 
the decisions that produce the zero observations rather than use the tobit model 
mechanically. 

Consider, for instance, the model of wages in (8.22). We can argue that each 
person has a reservation wage W, below which the person would not want to 
work. If W 2 is the market wage for this person (i.e., the wage that employers 
are willing to pay) and W 2 > W t , then we will observe the person as working 
and the observed wage W is equal to W 2 . On the other hand, if W, > W 2 , we 
observe the person as not working and the observed wage is zero. 

If this is the story behind the observed zero wages (one can construct other 
similar stories), we can formulate the model as follows. Let the reservation 
wages W v and market wages W 2 , be given by 

= PjX,, + u u 
W 2 , = (3 2 x 2; + u 2i 


“For details see Maddala. Limited-Dependent, Chap. 6. A convenient computer program is the 
LIMDEP program by William Greene at New York University. This program can also be used 
to estimate the other models discussed later. 
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The observed W, is given by 



if W 2l 2 W u 
otherwise 


We can write this as 


W, = 


P 2 * 2 , + « 2 , 
0 


if u 2 , - u u > (3 r V|, - (3 2 x 2 , 
otherwise 


(8.23) 


Note the difference between this formulation and the one in equation (8.19). 
The criterion that W, = 0 is not given by w 2 , ^ - (3 2 x 2l as in the simple tobit 
model but by u 2l - u u < p,*,, — (3 2 y 2i . Hence estimation of a simple tobit model 
in this case produces inconsistent estimates of the parameters. 

Estimation of the model given by (8.23) is somewhat complicated to be dis¬ 
cussed here. 35 However, the purpose of the example is to show that every time 
we have some zero observations, we should not use the tobit model. In fact, 
we can construct similar models for automobile expenditures and hours worked 
wherein the zero observations are a consequence of decisions by individuals. 
The simple censored regression model (or the tobit model) is applicable only in 
those cases where the latent variable can, in principle, take negative values and 
the observed zero values are a consequence of censoring and nonobservability. 


The Truncated Regression Model 

The term truncated regression model refers to a situation where samples are 
drawn from a truncated distribution. It is important to keep in mind the dis¬ 
tinction between this model and the tobit model. In the censored regression 
model (tobit model) we have observations on the explanatory variable x, for all 
individuals. It is only the dependent variable y* that is missing for some indi¬ 
viduals. In the truncated regression model, we have no data on either y* or x, 
for some individuals because no samples are drawn if yj is below or above a 
certain level. The New Jersey negative-income-tax experiment was an example 
in which high-income families were not sampled at all. Families were selected 
at random but those with incomes higher than 1.5 times the 1967 poverty line 
were eliminated from the study. If we are estimating earnings functions from 
these data, we cannot use the OLS method. We have to take account of the 
fact that the sample is from a truncated distribution. Figure 8.5 illustrates this 
point. The observations above y = L are omitted from the sample. If we use 
OLS, the estimated line gives a biased estimate of the true slope. A method of 
estimation often suggested is, again, the maximum likelihood method. 36 

Suppose that the untruncated regression model is 

y* = (3x, + u, u, ~ IN(0, a 2 ) 


3 'It is discussed in Maddala, Limited-Dependent, Sec. 6.11. 
36 Maddala, Limited-Dependent, Chap. 6. 
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Figure 8.5. Truncated regression model. 


Now, only observations with y* < L are included in the sample. The total area 
under the normal curve up to y* < L, that is, uja < (L - (3 x,)/a is F[(L — 
0x,)/ct], where F(-) is the cumulative distribution function of the standard nor¬ 
mal. The density function of the observed y, is the standard normal density 
except that its total area is F[{L — fix, )/a]. Since the total area under a prob¬ 
ability density function should be equal to 1, we have to normalize by this 
factor. Thus the density function of the observations y, from the truncated nor¬ 
mal is 


8(y) = 



10 




L - fixj 


ify, 

otherwise 


The log-likelihood is, therefore, 

log L = -n log fix) 1 - 2 log F . 

Maximizing this with respect to 0 and a, we get the ML estimates of 0 and a. 
Again, we need not be concerned with the details of the ML estimation method. 
As an illustration of how different the OLS and ML estimates can be, we pre¬ 
sent the estimates obtained by Hausman and Wise 37 in Table 8.6. The depen¬ 
dent variable was log earnings. The results show how the OLS estimates are 
biased. In this particular case they are all biased toward zero. 


J7 J. A. Hausman and D. A. Wise, “Social Experimentation, Truncated Distributions, and Effi¬ 
cient Estimation,” Econometrica, Vol. 45, 1977, p. 319-339. 
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Table 8.6 Earnings Equations Estimated from the 


New Jersey Negative-Income-Tax Experiment" 


Variable 

OLS 

ML 

Constant 

8.203 

9.102 


(0 091 ) 

(0.026) 

Education 

0.010 

0.015 


(0 006 ) 

(0 007 ) 

IQ 

0.002 

0.006 

(0.002) 

(0.005) 

Training 

0.002 

0.006 


( 0 . 001 ) 

( 0 . 003 ) 

Union 

0.090 

0.246 


(0 030 ) 

(0 089 ) 

Illness 

-0.076 

-0.226 


(0.038) 

(0.107) 

Age (linear) 

-0.003 

-0.016 


(0 002 ) 

(0 005 ) 


"Figures in parentheses are standard errors. 


Summary 


1. In this chapter we discussed 

(a) Dummy explanatory variables. 

(b) Dummy dependent variables. 

(c) Truncated dependent variables. 

2. Dummy explanatory variables can be used in tests for coefficient stability 
in the linear regression models, for obtaining predictions in the linear regression 
models, and for imposing cross-equation constraints. These uses have been 
illustrated with examples. 

3. One should exercise caution in using the dummy variables when there is 
heteroskedasticity or autocorrelation. In the presence of autocorrelated errors, 
the dummy variables for testing stability have to be defined suitably. The 
proper definitions are given in Section 8.6. 

4. Regarding the dummy dependent variables, there are three different 
models that one can use: the linear probability model, the logit model, and the 
probit model. The linear discriminant function is closely related to the linear 
probability model. The coefficients of the discriminant function are just pro¬ 
portional to those of the linear probability model (see Section 8.8). Thus there 
is nothing new in linear discriminant analysis. The linear probability model has 
the drawback that the predicted values can be outside the permissible interval 
( 0 , 1 ). 

5. In the analysis of models with dummy dependent variables, we assume 
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the existence of a latent (unobserved) continuous variable which is specified as 
the usual regression model. However, the latent variable can be observed only 
as a dichotomous variable. The difference between the logit and probit models 
is in the assumptions made about the error term. If the error term has a logistic 
distribution, we have the logit model. If it has a normal distribution, we have 
the probit model. From the practical point of view, there is not much to choose 
between the two. The results are usually very similar. If both the models are 
computed, one should make some adjustments in the coefficients to make them 
comparable. These adjustments have been outlined in Section 8.9. 

6. For comparing the linear probability, logit, and probit models, one can 
look at the number of cases correctly predicted. However, this is not enough. 
It is better to look at some measures of R 2 ' s. In Section 8.9 we discuss several 
measures of R 2 : squared correlation between y and y, Effron’s R 2 , Cragg and 
Uhler’s R 2 , and McFadden’s R 2 . For practical purposes the first two are de¬ 
scriptive enough. The computation of the different R 2 ’s is illustrated with an 
example in Section 8.10. 

7. The tobit model is a censored regression model. Observations on the la¬ 
tent variable y* are missing (or censored) if y* is below (or above) a certain 
threshold level. This model has been used in a large number of applications 
where the dependent variable is observed to be zero for some individuals in 
the sample (automobile expenditures, medical expenditures, hours worked, 
wages, etc.). However, on careful scrutiny we find that the censored regression 
model (tobit model) is inappropriate for the analysis of these problems. The 
tobit model is, strictly speaking, applicable in only those situations where the 
latent variable can, in principle, take negative values, but these negative values 
are not observed because of censoring. Where the zero observations are a con¬ 
sequence of individual decisions, these decisions should be modeled appropri¬ 
ately and the tobit model should not be used mechanically. 

8. Sometimes samples are drawn from truncated distributions. In this case 
the truncated regression model should be used. This model is different from 
the censored regression model (tobit model). 

9. The LIMDEP program can be used to compute the logit, probit, tobit, 
truncated regression, and related models discussed here. 


Exercises 


1. Explain the meaning of each of the following terms. 

(a) Seasonal dummy variables. 

(b) Dummy dependent variables. 

(c) Linear probability model. 

(d) Linear discriminant function. 

(e) Logit model. 

(f) Probit model. 

(g) Tobit model. 

(h) Truncated regression model. 
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2. What would be your answer to the following queries? 

(a) My regression program refuses to estimate four seasonal coefficients 
when I enter the quarterly data including a zero-one dummy for each 
quarter. What am I supposed to do? 

(b) I estimated a model with a zero-one dependent variable using the 
logit and probit programs. The coefficients I got from the probit pro¬ 
gram were all smaller than the corresponding coefficients estimated 
by the logit program. Is there something wrong with my programs? 

(c) I have data on medical expenditures on a sample of individuals. 
Some of them, who did not have any ailments, or did not bother to 
go to the doctor even if they had ailments, had no expenditures. I 
wish to estimate the income elasticity of medical expenditures. I am 
thinking of dropping the individuals with zero expenditures and es¬ 
timating the model by OLS. My friend says that I would be overes¬ 
timating the income elasticity by doing this. Is she correct? 

3. Explain how you would use dummy variables for generating predictions 
from a regression model. 

4. In the model 


y, = 01 * 1 , + 02 * 2 , + 03 * 3 , + U, 

the coefficients are known to be related to a more basic economic param¬ 
eter a according to the equations 

0i + 02 = a 

0i + 03 = 

Explain how you would estimate a and the variance of a. 

5. In the model 


y„ = a*„ + (3*2, + w„ 

* 2 , = 0UC 2l + u 2t 
* 3 , = 0*1, + « 3 , 

where w„ ~ IN(0, 2<r 2 ), u 2 , ~ IN(0, cr 2 ), w 3 , ~ IN(0, a 2 ), and u ]t , u 2t , u }l are 
mutually independent, explain how you will estimate a, (3, and a 2 . 

6. The following equation was estimated to explain a short-term interest rate: 
(Figures in parentheses are standard errors.) 

Y, = 5.5 + 0.93*, - 0.38x,_, - 5.2(P,/P,_ 4 ) + 0.50T,_, 

(1 3 ) (0 04 ) (0 09 ) (1 3 ) (0 07 ) 

- 0.05(£>, - D 4 ) + 0.08(£>, - D 4 ) + 0.06(£> 3 - D 4 ) 

(0 04 ) (0 04 ) (0 04 ) 

R 2 = 0.90 R 2 = 0.89 SEE = 0.19 DW =1.3 T = 92 

where Y = interest rate on 4 to 6-month commercial paper (percent) 
x = interest rate on 90-day Treasury bills (percent) 

r, , , 11 for first quarter 

D | = seasonal dummy = ( » 

^0 otherwise 
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„ . , f ] for second quarter etc. 

D 2 = seasonal dummy = „ 

[0 otherwise 

SEE = a 

(a) What is the estimated seasonal pattern in the commercial paper rate? 

(b) About how much would R 2 drop if P,IP, 4 were dropped from the 
equation? 

(c) Will R 2 increase or decrease if P,/P,_ 4 is dropped? 

(d) Instead of using P,IP,_ 4 , suppose that we use the percentage rate of 
inflation II, defined by 

n ' = ,00 (^) 

What will be the new coefficients and their standard errors? 

7. You are asked to estimate the coefficients in a linear discriminant function. 
You do not have a computer program for this. All you have is a program 
for multiple regression analysis. How will you compute the linear discrim¬ 
inant function? 

8. Consider a model with a zero-one dependent variable. You have a multiple 
regression program and a program for the logit and probit models. You have 
computed the coefficients of the linear probability model and the logit and 
probit models: 

(a) How will you transform the coefficients of the three models so that 
they are comparable? 

(b) How will you compute the R 2 ' s for the three models? 

(c) By what criteria will you choose the best model? 

9. Explain how you will formulate a model explaining the following. In each 
case the sample consists of some observations for which the dependent 
variable is zero. Suggest a list of explanatory variables in each case. 

(a) Automobile expenditures (in a year) of a number of families. 

(b) Hours worked by a group of married women. 

(c) Amount of child support received by a number of working wives. 

(d) Medical expenditures of a number of families. 

(e) Amount of loan granted by a bank to a number of applicants. 

(f) Amount of financial aid received by a group of students at a college. 

10. Table 8.7 presents data on bride and bride-groom characteristics and dow¬ 
ries for marriages in rural south-central India. 38 The variable definitions 
follow the table: 

(a) Estimate an equation explaining the determinants of the dowry. 

(b) Estimate probit and tobit equations explaining the determinants of 
bride’s years of schooling and groom’s years of schooling. 

“The data have kindly been provided by Anil B. Deolalikar. They have been analyzed in A. B. 
Deolalikar and V. Rao, “The Demand for Dowries and Bride Characteristics in Marriage: Em¬ 
pirical Estimates for Rural South Central India.” Manuscript, University of Washington, Sep¬ 
tember 1990. 
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dummy (husband’s father had some primary education) 
dummy (husband’s father completed primary education) 
dummy (husband’s father completed middle or 
secondary school education) 
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9 SIMULTANEOUS EQUATIONS MODELS 


9.1 Introduction 


In the usual regression model y is the dependent or determined variable and jc,, 
x 2 , , are the independent or determining variables. The crucial assumption 

we make is that the jc’s are independent of the error term u. Sometimes, this 
assumption is violated: for example, in demand and supply models. It is also 
violated in the case where the jc’s are measured with error, but we postpone 
discussion of this case to Chapter 11. We illustrate here the case of a demand 
and supply model. 

Suppose that we write the demand function as 

q = a + pp + u (9.1) 

where q is the quantity demanded, p the price, and u the disturbance term, 
which denotes random shifts in the demand function. In Figure 9.1 we see that 
a shift in the demand function produces a change in both price and quantity if 
the supply curve has an upward slope. If the supply curve is horizontal (i.e.. 




Supply curve sloping upward Supply curve horizontal 



Supply curve vertical 

Figure 9.1 Effects of shifts in the demand function on price. 
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completely price inelastic), a shift in the demand curve produces a change in 
price only, and if the supply curve is vertical (infinite price elasticity), a shift 
in the demand curve produces a change in quantity only. 

Thus in equation (9.1) the error term u is correlated with p when the supply 
curve is upward sloping or perfectly horizontal. Hence an estimation of the 
equation by ordinary least squares produces inconsistent estimates of the pa¬ 
rameters. 

We could have written the demand function as 

p = a' + $'q + u' (9.1') 

But, again, u' will be correlated with q if the supply function is upward sloping 
or perfectly vertical. There is the question of whether the demand function 
should be written as in equation (9.1) or (9.1'). If it is written as in (9.1), we 
say that the equation is normalized with respect to q (i.e., the coefficient of q 
is unity). If it is written as in (9.1'), we say that the equation is normalized with 
respect to p. Since price and quantity are determined by the interaction of 
demand and supply, it should not matter whether we normalize the equations 
with respect to p or with respect to q. 

The estimation methods we use should produce identical estimates whatever 
normalization we adopt. However, the preceding discussion and Figure 9.1 sug¬ 
gests that if quantity supplied is not responsive to price, the demand function 
should be normalized with respect to p, that is, it should be written as in (9.1'). 
On the other hand, if quantity supplied is highly responsive to price, the de¬ 
mand function should be normalized with respect to q as in (9.1). An empirical 
example will be given in Section 9.3 to illustrate this point. The question of 
normalization is discussed in greater detail in Section 9.7. 

Returning to the demand and supply model, the problem is that we cannot 
consider the demand function in isolation when we are studying the relationship 
between quantity and price as in (9.1). The solution is to bring the supply func¬ 
tion into the picture, and estimate the demand and supply functions together. 
Such models are known as simultaneous equations models. 

9.2 Endogenous and Exogenous Variables 


In simultaneous equations models variables are classified as endogenous and 
exogenous. The traditional definition of these terms is that endogenous vari¬ 
ables are variables that are determined by the economic model and exogenous 
variables are those determined from outside. 

Endogenous variables are also called jointly determined and exogenous vari¬ 
ables are called predetermined. (It is customary to include past values of en¬ 
dogenous variables in the predetermined group.) Since the exogenous variables 
are predetermined, they are independent of the error terms in the model. They 
thus satisfy the assumptions that the x's satisfy in the usual regression model 
of y on x's. 

The foregoing definition of exogeneity has recently been questioned in recent 
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econometric literature, but we have to postpone this discussion to a later sec¬ 
tion (Section 9.10). We will discuss this criticism and its implications after going 
through the conventional methods of estimation for simultaneous equation 
models. 

Consider now the demand and supply model 

q = a, + b t p + c { y + u, demand function ^ 2 ) 

q = a 2 + b 2 p + c 2 R + u 2 supply function 

q is the quantity, p the price, y the income, R the rainfall, and w, and u 2 are the 
error terms. Here p and q are the endogenous variables and y and R are the 
exogenous variables. Since the exogenous variables are independent of the er¬ 
ror terms w, and u 2 and satisfy the usual requirements for ordinary least squares 
estimation, we can estimate regressions of p and q on y and R by ordinary least 
squares, although we cannot estimate equations (9.2) by ordinary least squares. 
We will show presently that from these regressions of p and q on y and R we 
can recover the parameters in the original demand and supply equations (9.2). 
This method is called indirect least squares —it is indirect because we do not 
apply least squares to equations (9.2). The indirect least squares method does 
not always work, so we will first discuss the conditions under which it works 
and how the method can be simplified. To discuss this issue, we first have to 
clarify the concept of identification. 


9.3 The Identification Problem: 

Identification Through Reduced Form 


We have argued that the error terms u ] and u 2 are correlated with p in equations 
(9.2), and hence if we estimate the equation by ordinary least squares, the pa¬ 
rameter estimates are inconsistent. Roughly speaking, the concept of identifi¬ 
cation is related to consistent estimation of the parameters. Thus if we can 
somehow obtain consistent estimates of the parameters in the demand func¬ 
tion, we say that the demand function is identified. Similarly, if we can some¬ 
how get consistent estimates of the parameters in the supply function, we say 
that the supply function is identified. Getting consistent estimates is just a nec¬ 
essary condition for identification, not a sufficient condition, as we show in the 
next section. 

If we solve the two equations in (9.2) for q and p in terms of y and R, we get 


<7 = 


afi 2 
b 2 


C\b 2 

b i b 2 — b^ 


e 2 b\ 

-- —R + an error 

b 2 ~ b , 


£i_(Q 

b 2 —b 




b i 


R + an error 


(9.3) 


These equations are called the reduced-form equations. Equations (9.2) are 
called the structural equations because they describe the structure of the eco¬ 
nomic system. We can write equations (9.3) as 
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q — TTi + TT^y + tt 3 R + v, 
p = tt 4 + -n^y + tt 6 R + v 2 
where v, and v 2 are error terms and 


(9.4) 


a,b 2 - a-,b, c,b 2 



etc. 


The 7t’s are called reduced-form parameters. The estimation of the equations 
(9.4) by ordinary least squares gives us consistent estimates of the reduced 
form paramters. From these we have to obtain consistent estimates of the pa¬ 
rameters in equations (9.2). These parameters are called structural parameters. 
Comparing (9.3) with (9.4) we get 



Cl = TT 6 (£, - t> 2 ) 
<3, = fr | — 6 iTt 4 


*2 

c 1 = ~%{B X - B 2 ) 

a 2 — 2 T j B 2 7Tq 


Since d„ a 2 , B„ B 2 , t 2 are all single-valued functions of the fr, they are con¬ 
sistent estimates of the corresponding structural parameters. As mentioned 
earlier, this method is known as the indirect least squares method. 

It may not be always possible to get estimates of the structural coefficients 
from the estimates of the reduced-form coefficients, and sometimes we get mul¬ 
tiple estimates and we have the problem of choosing between them. For ex¬ 
ample, suppose that the demand and supply model is written as 


q = a, + b t p + c,y + Ui demand function 
q = a 2 + b 2 p + u 2 supply function 

Then the reduced form is 



c 2 b, 

B i 


+ 



+ v, 


(9.5) 


P = 



+ 


B i 


y + v 2 


or 


q = 'iTi + 'ir 2 y + v, 
p = tt 3 + -n^y + v 2 

In this case B 2 = fr 2 /Tr 4 and a 2 = ir ] — B 2 ir 3 . But there is no way of getting es¬ 
timates of a v b t , and c,. Thus the supply function is identified but the demand 
function is not. On the other hand, suppose that we have the model 

q = a, + b t p + u ] demand function 

q = a 2 + b 2 p + c 2 R + u 2 supply function 
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Now we can check that the demand function is identified but the supply func¬ 
tion is not. 

Finally, suppose that we have the system 

q = a, + b,p + c x y + d x R + m, demand function . 

q - a 2 + b 2 p + u 2 supply function 

Rainfall affects demand (if there is rain, people do not go shopping), but not 
supply. The reduced-form equations are 


4 = 

P = 


- a A , _£|*2_ v , 

b 2 - b { b 2 - b{ 

a, — a 2 c, 

— - 1 + - ! - y H- 

b 2 - b t b 2 ~ b x b 2 


d\b 2 


b 2 - b t 


-R + v 


R + v. 


i 


or 


q = TT| + TT^y + tt 3 R + V| 

P = TT 4 + TTuV + Ttfft + v 2 

Now we get two estimates of b 2 . One is f> 2 = fr 2 /Tr 5 and the other is f> 2 = 
TT 3 /Tr 6 , and these need not be equal. For each of these we get an estimate of a 2 , 
which is a 2 = fr, — £fr 4 . 

On the other hand, we get no estimates for the parameters a u b x , c,, and d x 
of the demand function. Here we say that the supply function is over-identified 
and the demand function is under-identified. When we get unique estimates for 
the structural parameters of an equation from the reduced-form parameters, we 
say that the equation is exactly identified. When we get multiple estimates, we 
say that the equation if over-identified, and when we get no estimates, we say 
that the equation is under-identified (or not identified). 

There is a simple counting rule available in the linear systems that we have 
been considering. This counting rule is also known as the order condition for 
identification. This rule is as follows: Let g be the number of endogenous vari¬ 
ables in the system and k the total number of variables (endogenous and ex¬ 
ogenous) missing from the equation under consideration. Then: 

1. If k — g — 1, the equation is exactly identified. 

2. If k > g - 1, the equation is over-identified. 

3. If k < g — 1, the equation is under-identified. 

This condition is only necessary but not sufficient. In Section 9.4 we will 
give a necessary and sufficient condition. 

Let us apply this rule to the equation systems we are considering. In equa¬ 
tions (9.2), g, the number of endogenous variables, is 2, and there is only one 
variable missing from each equation. Hence both equations are identified ex¬ 
actly. In equations (9.5), again g = 2. There is no variable missing from the 
first equation; hence it is under-identified. There is one variable missing in the 
second equation; hence it is exactly identified. In equation (9.6), there is no 
variable missing in the first equation; hence it is not identified. In the second 
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equation there are two variables missing; thus k > g - 1 and the equation is 
over-identified. 

Illustrative Example 

The indirect least squares method we have described is rarely used. In the 
following sections we describe a more popular method of estimating simulta- | 
neous equation models. This is the method of two-stage least squares (2SLS). 
However, if some coefficients in the reduced-form equations are close to zero, 
this gives us some information about what variables to omit in the structural 
equations. We will provide a simple example of a two-equation demand and 
supply model where the estimates from OLS, reduced-form least squares, and 
indirect least squares provide information on how to formulate the model. The 
example also illustrates some points we have raised in Section 9.1 regarding 
normalization. 

The model is from Merrill and Fox. 1 In Table 9.1 data are presented for de¬ 
mand and supply of pork in the United States for 1922-1941. The model esti¬ 
mated by Merrill and Fox is 

Q = a, + b x P + c x Y + u demand function 

Q = a 2 + b 2 P + c 2 Z + v supply function 

The equations were fitted for 1922-1941. The reduced-form equations were 
(standard errors in parentheses) 

P = -0.0101 + 1.0813 Y - 0.8320Z R 2 = 0.893 

( 0 . 1339 ) ( 0 . 1159 ) 

Q = 0.0026 - 0.0018 Y + 0.6839Z R 2 = 0.898 

( 0 . 0673 ) ( 0 . 0582 ) 

The coefficient of Y in the second equation is very close to zero and the variable 
Y can be dropped from this equation. This would imply that b 2 = 0, or supply 
is not responsive to price. In any case, solving from the reduced form to the 
structural form, we get the estimates of the structural equation as 

Q = -0.0063 - 0.8220P + 0.8870F demand function 

Q = 0.0026 - 0.0017P + 0.6825Z supply function 

The least squares estimates of the demand function are: Normalized with re¬ 
spect to Q: 

Q = - 0.0049 - 0.7205 P + 0.7646F R 2 = 0.903 

( 0 . 0594 ) ( 0 . 0967 ) 

Normalized with respect to P: 

P = -0.0070 - 1.2518g + 1.0754F R 2 = 0.956 

, ( 0 . 1032 ) ( 0 . 0861 ) 

'William C. Merrill and Karl A. Fox, Introduction to Economic Statistics (New York: Wiley, 
1971). 
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Table 9.1 Demand and Supply of Pork, United 
States, 1922-1941“ 


Year 

P, 

Q, 

Y, 

z, 

1922 

26.8 

65.7 

541 

74.0 

1923 

25.3 

74.2 

616 

84.7 

1924 

25.3 

74.0 

610 

80.2 

1925 

31.1 

66.8 

636 

69.9 

1926 

33.3 

64.1 

651 

66.8 

1927 

31.2 

67.7 

645 

71.6 

1928 

29.5 

70.9 

653 

73.6 

1929 

30.3 

69.6 

682 

71.2 

1930 

29.1 

67.0 

604 

69.6 

1931 

23.7 

68.4 

515 

68.0 

1932 

15.6 

70.7 

390 

74.8 

1933 

13.9 

69.6 

364 

73.6 

1934 

18.8 

63.1 

411 

70.2 

1935 

27.4 

48.4 

459 

46.5 

1936 

26.9 

55.1 

517 

57.6 

1937 

27.7 

55.8 

551 

58.7 

1938 

24.5 

58.2 

506 

58.0 

1939 

22.2 

64.7 

538 

67.2 

1940 

19.3 

73.5 

576 

73.7 

1941 

24.7 

68.4 

697 

66.5 


"1°,, retail price of pork (cents per pound); Q„ 
consumption of pork (pounds per capita); Y„ disposable 
personal income (dollars per capita); Z„ “predetermined 
elements in pork production.” 

Source: William C. Merrill and Karl A. Fox, Introduction 
to Economic Statistics (New York: Wiley, 1971), p. 539. 


The structural demand function can also be written in the two forms: Normal¬ 
ized with respect to Q: 

Q = -0.0063 - 0.8220 P + 0.8870 Y 
Normalized with respect to P : 

P - -0.0077 - 1.2165(2 + 1.0791 Y 

The estimates of the parameters in the demand function are almost the same 
with the direct least squares method as with the indirect least squares method 
when the demand function is normalized with respect to P. 

Which is the correct normalization? We argued in Section 9.1 that if quantity 
supplied is not responsive to price, the demand function should be normalized 
with respect to P. We saw that the fact that the coefficient of Y in the reduced- 
form equation for Q was close to zero implied that b 2 = 0 or quantity supplied 
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is not response to price. This is also confirmed by the structural estimate of b 2 , 
which shows a wrong sign for b 2 as well but a coefficient close to zero. 

Dropping P from the supply function and using OLS, we get the supply func¬ 
tion as 

Q = 0.0025 + 0.6841Z i? 2 = 0.898 

( 0 . 0857 ) 

Thus we normalize the demand function with respect to P, drop the variable P 
from the supply function, and estimate both equations by OLS. 

9.4 Necessary and Sufficient Conditions 
for Identification 


In Section 9.3 our discussion was in terms of obtaining estimates for the struc¬ 
tural parameters from estimates of the reduced-form parameters. An alterna¬ 
tive way of looking at the identification problem is to see whether the equation 
under consideration can be obtained as a linear combination of the other equa¬ 
tions. 

Consider, for instance, equations (9.5). Take a weighted average of the two 
equations with weights w and (1 - w). Then we get 

q = w{a x + b x p + c,y) + (1 - w)(a 2 + b#) + u'l 
= a\ + b]p + ctf + u\ (9.7) 

where a\ = wa x + (1 — w)a 2 
b\ = wb j + (1 — w)b 2 
c\ = wc x and u\ = wu x + (1 - w)u 2 

Equation (9.7) looks like the first equation in (9.5). Thus when we estimate the 
parameters of the demand function, we do not know whether we are getting 
estimates of the parameters in the demand function or in some weighted aver¬ 
age of the demand and supply functions. Thus the parameters in the demand 
function are not identified. The same cannot be said about the supply function 
because the only way that equation (9.7) can look like the supply function in 
(9.5) is if cl = 0 (i.e., wc t = 0). But since c, 0, we must have w = 0. Tljat 
is, the weighted average gives zero weight to the demand function. Hence when 
we estimate the supply function, we are sure that the estimates we have are of 
the supply function. Thus the parameters of the supply function are identified. 

Checking this condition has been easy with two equations. But when we have 
many equations, we need a more systematic way of checking this condition. 
For illustrative purposes consider the case of three endogenous variables y,, 
y 2 , y 3 and three exogenous variables z t , z 2 , z 3 . We will mark with a cross x if a 
variable occurs in an equation and a 0 if not. Suppose that the equation system 
is the following: 
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Equation 

y, 

yt 


z, 

Z 2 

Zi 

l 

X 

0 

X 

X 

0 

X 

2 

X 

0 

0 

X 

0 

X 

3 

0 

X 

X 

X 

X 

0 


The rule for identification of any equation is as follows 

1. Delete the particular row. 

2. Pick up the columns corresponding to the elements that have zeros in 
that row. 

3. If from this array of columns we can find (g - 1) rows and columns that 
are not all zeros, where g is the number of endogenous variables and no 
column (or row) is proportional to another column (or row) for all param¬ 
eter values, then the equation is identified. Otherwise, not. If there is one 
such column (or row) delete that column (or row). 

This condition is called the rank condition for identification and is a neces¬ 
sary and sufficient condition. 2 

We will now apply the order and rank conditions to the illustrative example 
we have. The number of equations is g = 3. Hence the number of missing 
variables is 2 for the first equation, 3 for the second, and 2 for the third. Thus, 
by the order condition, the first and third equations are exactly identified and 
the second equation is over-identified. We will see that the rank condition gives 
us a different answer. 

To check the rank condition for equation 1, we delete the first row and pick 
up the columns corresponding to the missing variables y 2 and z 2 . The columns 
are 


0 0 

x x 

Note that we have only one row with not all elements zero. Thus, by the rank 
condition the equation is not identified. For equation 2, we delete row 2, and 
pick up the columns corresponding to y 2 , y 3 , and z 2 . We get 

0x0 

XXX 

We now have two rows (and two columns) with not all elements zero. Thus the 
equation is identified. Similarly, for the third equation deleting the third row, 
the columns for y, and z 3 give 


x x 

x x 


The array of columns is called a matrix and the condition that we have stated is that the rank 
of this matrix be (g - I): hence the use of the term rank condition. We have avoided the use 
of matrix notation and stated the condition in an alternative fashion. A derivation using matrix 
notation is presented m the appendix to this chapter. 
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Again we have two rows with not all elements zero. Hence the equation is 
identified. Note that the rank condition stastes whether the equation is identi¬ 
fied or not. From the order condition we know whether it is exactly identified 
or over-identified. 

In summary the second and third equations are estimable. The first one is 
not, and the order condition misleads us into thinking that it is so. There are 
many estimation methods for simultaneous equations models that break down 
if the order condition is not satisfied but do give us estimates of the parameters 
if the order condition is satisfied even if the rank condition is not. In such cases 
these estimates are meaningless. Thus it is desirable to check the rank condi¬ 
tion. In our example, for equation 1, the rank condition is not satisfied. What 
this means is that the estimates we obtain for the parameters in equation 1 are 
actually estimates of some linear combinations of the parameters in all the 
equations and thus have no special economic interpretation. This is what we 
mean when we say that equation 1 is not estimable. 

Illustrative Example 

As an illustration, consider the following macroeconomic model with seven 
endogenous variables and three exogenous variables. The endogenous and ex¬ 
ogenous variables are: 

Endogenous Exogenous 

C = real consumption G = real government purchases 

/ = real investment T = real tax receipts 

N = employment M — nominal money stock 

P = price level 
R = interest rate 
Y — real income 
W = money wage rate 
The equations are: 

(1) C = o, + b t Y — c{F + d t R + u { (consumption function) 

(2) / = a 2 + b 2 Y + c 2 R + u 2 (investment function) 

(3) Y = C + / + G (identity) 

(4) M = a 3 + b 2 Y + c^R + dyP + w 3 (liquidity preference function) 

(5) Y — a A + b 4 N + u 4 (production function) 

(6) N = a 5 + b s W + c 5 P + « 5 (labor demand) 

(7) N ~ a 6 + b 6 W + c 6 P + m 6 (labor supply) 
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The question is which of these equations is under-identified, exactly identified, 
and over-identified. To answer this question, we prepare the following table. A 
1 denotes that the variable is present, a 0 denotes that it is missing. 


Equation 

C 

I 

N 

p 

R 

Y 

w 

G 

T 

M 

1 

1 

0 

0 

0 

1 

1 

0 

0 

1 

0 

2 

0 

1 

0 

0 

1 

1 

0 

0 

0 

0 

3 

1 

1 

0 

0 

0 

1 

0 

1 

0 

0 

4 

0 

0 

0 

1 

1 

1 

0 

0 

0 

1 

5 

0 

0 

1 

0 

0 

1 

0 

0 

0 

0 

6 

0 

0 

1 

l 

0 

0 

1 

0 

0 

0 

7 

0 

0 

1 

1 

0 

0 

1 

0 

0 

0 


Note that equation 3 is an identity and does not have any parameters to be 
estimated. Hence we do not need to discuss its identification. The number of 
endogenous variables minus one is 6 in this model. By the order condition we 
get the result that equations 1 and 4 are exactly identified and equations 2, 5, 
6, and 7 are over-identified. 

Let us look at the rank condition for equation 1. The procedure is similar for 
other equations. Delete the first row and gather the columns for the missing 
variables I, N, P, W, G, M. We get 

1 0 0 0 0 0 
1 0 0 0 1 0 
0 0 1 0 0 1 
0 10 0 0 0 
0 1110 0 
0 1110 0 

Since we have six rows (and six columns) whose elements are not all zero, the 
equation is identified. It can be checked that the same is true for equations 2, 
4, and 5. However, for equations 6 and 7, we cannot find six rows whose ele¬ 
ments are not all zero. Thus, equations 6 and 7 are not identified by the rank 
condition even though they are over-identified by the order condition. 


9.5 Methods of Estimation: The 
Instrumental Variable Method 


In previous sections we discussed the identification problem. Now we discuss 
some methods of estimation for simultaneous equations models. Actually, we 
have already discussed one method of estimation: the indirect least squares 
method. However, this method is very cumbersome if there are many equations 
and hence it is not often used. Here we discuss some methods that are more 
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generally applicable. These methods of estimation can be classified into two 
categories: 

1. Single-equation methods (also called “limited-information methods”). 

2. System methods (also called “full-information methods”). 

We will not discuss the full-information methods here because they involve 
algebraic detail beyond the scope of this book. We discuss single-equation 
methods only. In these methods we estimate each equation separately using 
only the information about the restrictions on the coefficients of that particular 
equation. For instance, in the illustrative example in Section 9.4 when we es¬ 
timate the consumption function (the first equation), we just make use of the 
fact that the variables I, N, P, W, G, M are missing and that the last two vari¬ 
ables are exogenous. We are not concerned about what variables are missing 
from the other equations. The restrictions on the other equations are used only 
to check identification, not for estimation. This is the reason these methods are 
called limited-information methods. In full-information methods we use infor¬ 
mation on the restrictions on all equations. 

A general method of obtaining consistent estimates of the parameters in si¬ 
multaneous-equations models is the instrumental variable method. Broadly 
speaking, an instrumental variable is a variable that is uncorrelated with the 
error term but correlated with the explanatory variables in the equation. 

For instance, suppose that we have the equation 

y = 3x + u 

where x is correlated with u. Then we cannot estimate this equation by ordinary 
least squares. The estimate of (3 is inconsistent because of the correlation be¬ 
tween x and u. If we can find a variable z that is uncorrelated with u, we can 
get a consistent estimator for (3. We replace the condition cov(z, u) = 0 by its 
sample counterpart 

l - 2 z(y - px) = 0 

This gives 

2 zy X z(3* + «) 

P = ^ “ Vi ^ P + 

2j zx 2j zx 

But 2 zuzx can be written as 1 In ^ zut 1 In ^zx 
this expression is 

cov(z, u) = 
cov(z, x) 

since cov(z, x) # 0. Hence plim p = p, thus proving that p is a consistent 
estimator for p. Note that we require z to be correlated with x so that cov(z, jc) 
#0. 


2 z u 

S ZX 

. The probability limit of 
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Now consider the simultaneous-equations model 

y t = b x y 2 + c,z, + c 2 z 2 + u { (9 ^ 

y 2 = ^2>’l + C3Z3 + «2 

where y„ y 2 are endogenous variables, z,, z 2 , z 3 are exogenous variables, and 
u„ w, are error terms. Consider the estimation of the first equation. Since z, 
and z 2 are independent of u t , we have cov(z,, w,) = 0 and cov (z 2 , «,) = 0. 
However, y 2 is not independent of u t . Hence cov(y 2 , tq) ^ 0. Since we have 
three coefficients to estimate, we have to find a variable that is independent of 
n,. Fortunately, in this case we have z, and cov(z„ «,) = 0. z, is the instrumen¬ 
tal variable for y 2 . Thus, writing the sample counterparts of these three co- 
variances, we have the three equations 

- 2 z,(y, - - f|Z, - c 2 z 2 ) = o 

n 

- 2 Z 2 CV 1 - b t y 2 - cjz, - c 2 z 2 ) = 0 (9.8') 

n 

- S ZiCVi - b x y 2 - c,z, - c 2 z 2 ) = 0 
n 

The difference between the normal equations for the ordinary least squares 
method and the instrumental variable method is only in the last equation. 

Consider the second equation of our model. Now we have to find an instru¬ 
mental variable for y, but we have a choice of z, and z 2 . This is because this 
equation is over-identified (by the order condition). 

Note that the order condition (counting rule) is related to the question of 
whether or not we have enough exogenous variables elsewhere in the system 
to use as instruments for the endogenous variables in the equation with un¬ 
known coefficients. If the equation is under-identified we do not have enough 
instrumental variables. If it is exactly identified, we have just enough instru¬ 
mental variables. If it is over-identified, we have more than enough instrumen¬ 
tal variables. In this case we have to use weighted averages of the instrument 
variables available. We compute these weighted averages so that we get the 
most efficient (minimum asymptotic variance) estimators. 

It has been shown (proving this is beyond the scope of this book) that the 
efficient instrumental variables are constructed by regressing the endogenous 
variables on all the exogenous variables in the system (i.e., estimating the re¬ 
duced-form equations). In the case of the model given by equations (9.8), we 
first estimate the reduced-form equations by regressing y, and y 2 on z„ z 2 , z 3 . 
We obtain the predicted values y, and y 2 and use these as instrumental vari¬ 
ables. For the estimation of the first equation we use y 2 , and for the estimation 
of the second equation we use y,. We can write y, and y 2 as linear functions of 
z i, z 2 , z,. Let us write 
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S\ = OhZi + a l2 z 2 + o, 3 z 3 
Si = a 2lZi + a 22?’2 + °2 3 Z 3 

where the a’ s are obtained from the estimation of the reduced-form equations 
by OLS. In the estimation of the first equation in (9.8) we use y 2 , z„ and z 2 as 
instruments. This is the same as using z„ z 2 , and z 3 as instruments because 

2 Si u 1 = 0 => 2 (« 2 iZi + a 22 z 2 + a 23 z 3 )u, = 0 

o 2 i 2 Zi«i + «22 2 z 2 «i + a 23 2 z 3 «i = 0 

But the first two terms are zero by virtue of the first two equations in (9.8'). 
Thus X S 2 u i = 0 => 2 z 3 «i = 0. Hence using y 2 as an instrumental variable 
is the same as using z 3 as an instrumental variable. This is the case with exactly 
identified equations where there is no choice in the instruments. 

The case with the second equation in (9.8) is different. Earlier, we said that 
we had a choice between z, and 2 as instruments for y,. The use of y, gives the 
optimum weighting. The normal equations now are 

2 SiK 2 = 0 and X z 3 «2 = 0 

2 yi «2 = 0 2 («nZi + z 2 + a n z 3 )u 2 = 0 

2 (°llZl + «I2Z2)«2 = 0 

since ^ z 3 u 2 = 0. Thus the optimal weights for z, and z 2 are a u and a n . 


Measuring R 2 

When we are estimating an equation with instrumental variables, a question 
arises as to how to report the goodness-of-fit measure R 2 . This question also 
arises with the two-stage-least squares (2SLS) method described in Section 9.6. 
We can think of two measures. 

1. R 2 = squared correlation between y and y. 

2. Measures based on residual sum of squares: 


R 2 = 1 - 


2 (y. - sr 

i=i _ 

2 (y. - y ) 2 


In the results we present we will be reporting the second measure. 

Note that the R 2 from the instrumental variable method will be lower than 
the R 2 from the OLS method. It is also conceivable that the R 2 would be neg¬ 
ative, but this is an indication that something is wrong with the specification— 
perhaps the equation is not identified. We illustrate this point with an example. 

Many computer programs present an R 2 for simultaneous-equations models. 
But since there is no unique measure of R 2 in such models, it is important to 



370 


9 SIMULTANEOUS EQUATIONS MODELS 


Table 9.2 Data for Wine Industry in Australia" 


Year 

Q 

5 

P' 

P h 

A 

Y 

1955-1956 

0.91 

85.4 

77.5 

35.7 

89.1 

1056 

1956-1957 

1.05 

88.4 

80.2 

37.4 

83.3 

1037 

1957-1958 

1.18 

89.1 

79.5 

37.7 

84.4 

1006 

1958-1959 

1.27 

90.5 

84.9 

37.1 

90.1 

1047 

1959-1960 

1.27 

93.1 

94.9 

36.2 

89.4 

1091 

1960-1961 

1.37 

97.2 

92.7 

35.0 

89.3 

1093 

1961-1962 

1.46 

100.3 

92.5 

37.6 

89.8 

1102 

1962-1963 

1.59 

100.3 

92.7 

40.1 

96.7 

1154 

1963-1964 

1.86 

101.5 

97.1 

39.7 

99.9 

1234 

1964-1965 

1.96 

104.8 

93.9 

38.3 

103.2 

1254 

1965-1966 

2.32 

107.5 

102.7 

37.0 

102.2 

1241 

1966-1967 

2.86 

111.8 

100.0 

36.1 

100.0 

1299 

1967-1968 

3.50 

114.9 

119.5 

35.4 

103.0 

1287 

1968-1969 

3.96 

117.9 

119.7 

35.1 

104.2 

1369 

1969-1970 

4.21 

122.3 

125.2 

34.5 

113.0 

1443 

1970-1971 

4.54 

128.2 

134.1 

34.5 

132.5 

1517 

1971-1972 

4.93 

134.1 

124.3 

34.3 

143.6 

1562 

1972-1973 

5.40 

145.1 

119.0 

34.3 

176.2 

1678 

1973-1974 

6.13 

174.9 

108.5 

31.9 

159.9 

1769 

1974-1975 

6.29 

237.2 

107.9 

31.0 

182.1 

1847 


"Data are not in logs. 


check the formula that the program uses. For instance, the SAS program gives 
an R 2 , but it appears to be neither one of the above-mentioned measures. 


Illustrative Example 3 


Table 9.2 provides data on some characteristics of the wine industry in Aus¬ 
tralia for 1955-1956 to 1974-1975. It is assumed that a reasonable demand- 
supply model for the industry would be (where all variables are in logs) 


Q, = a 0 + a,P “ + a 2 P h , + a } Y, + a 4 A, + u, demand 


Q, = b 0 + b x P" + b 2 S, + v, 

where Q, = real per capita consumption of wine 
P'i = price of wine relative to CPI 
P h , = price of beer relative to CPI 
Y, = real per capita disposable income 
A, = real per capita advertising expenditure 
S, = index of storage costs 


supply 


5 I would like to thank Kim Sawyer for providing me with these data and the example. 
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Q, and P)' 1 are the two endogenous variables. The other variables are exoge¬ 
nous. For the estimation of the demand function we have only one instrumental 
variable S,. But for the estimation of the supply function we have available 
three instrumental variables: P b ,, Y„ and A,. 

The OLS estimation of the demand function gave the following results (all 
variables are in logs and figures in parentheses are /-ratios): 

Q = -23.651 + 1.158P„, - 0.275P fl - 0.603A + 3.212 Y R 2 = 0.9772 

(- 6 . 04 ) ( 4 . 0 ) (- 0 . 45 ) (- 1 . 3 ) ( 4 . 50 ) 

All the coefficients except that of Y have the wrong signs. The coefficient of 

P K not only has the wrong sign but is also significant. 

Treating P H , as endogenous and using 5 as an instrument, we get the following 
results: 

Q = -26.195 + 0.643P„, - 0.140P fl - 0.985A + 4.082 F R 2 = 0.9724 

(- 5 . 09 ) ( 0 . 98 ) (- 0 . 20 ) (- 1 . 51 ) ( 3 . 28 ) 

The coefficient of P„. still has a wrong sign but it is at least not significant. In 
any case the conclusion we arrive at is that the quantity demanded is not re¬ 
sponsive to prices and advertising expenditures but is responsive to income. 
The income elasticity of demand for wine is about 4.0 (significantly greater than 
unity). 

Turning next to the supply function, there are three instrumental variables 
available for P„ : P B , A, and Y. Also, the efficient instrumental variable is ob¬ 
tained by regressing P„ on P,„ A, Y, and S. The results obtained by using the 
OLS method and the different instrumental variables are as follows (figures in 
parentheses are asymptotic /-ratios for the instrumental variable methods; the 
P 2 ’s for the IV methods are computed as explained earlier): 


Method 

OLS 


Instrumental Variables 


Pb 

A 

F 

P„ 

Constant 

- 15.57 

- 10.76 

-17.65 

-16.98 

-16.82 


(- 18 . 36 ) 

(- 0 . 28 ) 

(- 6 . 6 ) 

(- 14 . 56 ) 

(- 15 . 57 ) 

P 

M w 

2.145 

0.336 

2.928 

2.676 

2.616 

( 8 . 99 ) 

( 0 . 02 ) 

( 3 . 02 ) 

( 7 . 30 ) 

( 7 . 89 ) 

s 

1.383 

2.131 

1.058 

1.163 

1.188 


( 8 . 95 ) 

( 0 . 36 ) 

( 2 . 47 ) 

( 5 . 72 ) 

( 6 . 24 ) 

R 2 

0.9632 

0.8390 

0.9400 

0.9525 

0.9548 


There are considerable differences between the estimates obtained by the 
different instrumental variables methods. Particularly, the use of P B seems to 
produce very different rseults. Y appears to be the best of all the instrumental 
variables. 

The IV estimates can be obtained by the procedures outlined earlier. But an 
easier way is to note that the IV estimator and the two-stage least squares 
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(2SLS) estimator (which we will describe in Section 9.6) are the same if the 
equation is exactly identified. This implies that we can get the IV estimator by 
the 2SLS method by changing the model so that the supply function is exactly 
identified. In this case, a standard computer program like the SAS can be used 
to get the estimates as well as their standard errors. In fact, this is the proce¬ 
dure we have followed. 

For instance, to get the IV estimator of the supply function using Y as the 
instrumental variable, we can specify the model as 

Q = p 0 + p,P„ + p 2 F + u demand function 

Q = a 0 + a X P W + a 2 S + v supply function 

Now the supply function is identified exactly. 

In this particular case, this does not seem to be an unreasonable model. 
Before we leave this example, we will present a case where the R 2 from the 
IV method is negative. Consider the model 

Q - Po + + P 2 ^ + u demand function 

Q = a 0 + a ,P„, + a 2 S + u supply function 

The first equation is exactly identified. We will use 5 as the instrumental vari¬ 
able for P K . (The IV estimator is the same as the 2SLS estimator.) The OLS 
estimation of the demand function gave the following results (figures in paren¬ 
theses are /-ratios): 

0 = - 15.30 + 2.064P„, + 1.418A R 2 = 0.9431 

(- 14 . 50 ) ( 6 . 60 ) ( 6 . 76 ) 

The IV estimation gave the following results: 

£> = -304.13 + 119.035P,, - 52.177A R 2 = -467.3 

(- 0 . 03 ) (0 03 ) (- 0 . 03 ) 

What is the problem? To see what the problem is, let us look at the reduced- 
form equations: 

£ = -(10.42) + 1.316A + 1.0855 R 2 = 0.8208 

(- 8 . 12 ) ( 1 . 75 ) ( 1 . 50 ) 

P w = 2.47 + 0.449A + 0.009125 R 2 * 0.4668 

(4 40 ) ( 1 . 37 ) ( 0 . 03 ) 

The coefficient of 5 in the equation for P w is almost zero. From the relation¬ 
ships (9.3) and (9.4) between the parameters in the reduced form and structural 
form discussed in Section 9.3, we note that this implies that p, —» °° and a 2 is 
indeterminate. Thus the demand function is not identified. Whenever the R 2 
from the IV estimator (or the 2SLS estimator) is negative or very low and the 
R 2 from the OLS estimation is high, it should be concluded that something is 
wrong with the specification of the model or the identification of that particular 
equation. 
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9.6 Methods of Estimation: The Two-Stage 
Least Squares Method 


The two-stage least squares (2SLS) method differs from the instrumental vari¬ 
able (IV) method described in Section 9.5 in that the y’s are used as regressors 
rather than as instruments, but the two methods give identical estimates. 
Consider the equation to be estimated: 

yi = b t y 2 + c { z t + Ui (9.9) 

The other exogenous variables in the system are z 2 , z-t, and z. A . 

Let v 2 be the predicted value of y 2 from a regression on y 2 on z,, z 2 , z } , z 4 (the 
reduced-form equation). Then 


y 2 = y 2 + v 2 


where v 2 , the residual is uncorrelated with each of the regressors z A , z 2 , z- u z A 
and hence with y 2 as well. (This is the property of least squares regression that 
we discussed in Chapter 4.) 

The normal equations for the efficient IV method are 

2 Siiyi ~ b\y 2 — c,z,) = 0 (9.10) 

2 zAyi ~ b<y 2 - c,z,) = 0 


Substituting y 2 = y 2 + v 2 we get 


2 Myi - b t y 2 - c.z,) - bi'Z y 2 v 2 = 0 
2 ZiCyi “ b^y 2 - c,z,) - fciS ^i v 2 = 0 


(9.11) 


But 2 z,v 2 = 0 and 2 9i v 2 — 0 since z. { and y 2 are uncorrelated with v 2 . Thus 
equations (9.11) give 


2 MVi - b x y 2 - c } z t ) = 0 
2 ZiCyi - b t y 2 - CiZi) = 0 


(9.12) 


But these are the normal equations if we replace y 2 by y 2 in (9.9) and estimate 
the equation by OLS. This method of replacing the endogenous variables on 
the right-hand side by their predicted values from the reduced form and esti¬ 
mating the equation by OLS is called the two-stage least squares (2SLS) 
method. The name arises from the fact that OLS is used in two stages: 


Stage 1. Estimate the reduced-form equations by OLS and obtain the pre¬ 
dicted y’s. 

Stage 2. Replace the right-hand-side endogenous variables by y’s and esti¬ 
mate the equation by OLS. 

Note that the estimates do not change even if we replace y, by y, in equation 
(9.9). Take the normal equations (9.12). Write 
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yi == y, + v, 

where v, is again uncorrelated with each of z„ z 2 , z } , z 4 . Thus it is also uncor¬ 
related with y, and y 2 , which are both linear functions of the z’s. Now substitute 
y, = y, + v, in equations (9.12). We get 

2 yi(yi ~ b,y 2 - c,z ,) + 2 Si v i = 0 

2 Zi(ji ~ b x y 2 - c,z,) + X = 0 

The last terms of these two equations are zero and the equations that remain 
are the normal equations from the OLS estimation of the equation 

= b x y 2 + c,z, + w 

Thus in stage 2 of the 2SLS method we can replace all the endogenous vari¬ 
ables in the equation by their predicted values from the reduced forms and then 
estimate the equation by OLS. 

What difference does it make? The answer is that the estimated standard 
errors from the second stage will be different because the dependent variables 
is y! instead of y,. However, the estimated standard errors from the second 
stage are the wrong ones anyway, as we will show presently. Thus it does not 
matter whether we replace the endogenous variables on the right-hand side or 
all the endogenous variables by y’s in the second stage of the 2SLS method. 

The preceding discussion has been in terms of a simple model, but the ar¬ 
guments are general because all the y’s are uncorrelated with the reduced form 
residuals v’s. Since our discussion has been based on replacing y by y + v, the 
arguments all go through for the general models. 


Computing Standard Errors 

We will now show how the standard errors we obtain from the second stage of 
the 2SLS method are not the correct ones and how we can obtain the correct 
standard errors. Consider the very simple model 

y t = Py 2 + « (9.13) 

where y, and y 2 are endogenous variables. There are some (more than one) 
exogenous variables in the system. We first estimate the reduced-form equation 
for y 2 and write 


y 2 = y 2 + v 2 

The IV estimator of p is obtained by solving 2 $i(y\ 

2 


0y 2 ) = 0, or 


Piv 


2 * 


yiyi 


The 2SLS estimator of p is obtained by solving X — Py 2 ) = 0, or 



9 6 METHODS OF ESTIMATION THE TWO-STAGE LEAST SQUARES METHOD 


375 


Since X Stf 2 = 2 + v 2 ) = 2 y\ the two estimators are identical and we 

will drop the subscripts IV and 2SLS and just write 0. 

- 2 y 2 >’i 2 M$y 2 + u) 

2*1 " 2 *! 


(Afore: 2 j^z = 2 *z-) Hence 

p p = a M^k 

where n is the sample size. Since u is independent of the exogenous variables 
in the system and y 2 is a linear function of the exogenous variables, we have 

plim - y y 2 u = 0 
n 

If we assume that 


2 si 


(1 In) y 


y,w 


plim ^ 2 y\ ^ 0 

we see that plim p = p and thus p is a consistent estimator for p. 

To compute the asymptotic variance of p, note that as n —*■ var(p) —*■ 0. 
Hence it is customary to consider ti var(p). We have 


n var(P) = n plim (P - P) 2 

[(1 In) 2 y 2 um/n) 2 y 2 u] 


n plim 


[(i in) 2 nv 


= plim 


(1/«X2 y 2 «)(2 y 2 u) 


[(1 In) 2 ylf 

of plim [(1/w) 2 y 2 ] 
{plim t(l/n) 2 yf]} 2 

of/plim 2 ylj 


(9.14) 


where erf = var(w). In practice, we get an estimate of var(P) as 
of/ 2 y 2 , where of is a consistent estimator of of,. This is obtained by plugging 
in the consistent estimator p of p in (9.13) and computing the sum of squares 
of residuals. Thus 


o’,. 


^ 2 Oi - Py 2 ) 2 
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There is the question of whether we should use n as divisor or (n - 1). Since 
the expression is asymptotic, the choice does not matter. Note that if we ig¬ 
nored the fact that y 2 is a random variable, then using (9.14) we would have 
written 


var(p) = var 




(9.15) 


as in a simple regression model. In effect, this is the expression we use in 
getting the asymptotic standard errors (after substituting of, for of), but the cor¬ 
rect derivation will have to use probability limits. 

Now what is wrong with the standard errors obtained at the second stage of 
the 2SLS method? To see what is wrong, let us write equation (9.13) as 


yi = $$2 + (« + Pv 2 ) 
= py 2 + w 


(9.16) 


When we estimate this equation by OLS, the standard error of (3 is obtained as 



Comparing this with (9.15), we note that what we need is of, and what we get 
from the second stage of 2SLS is af. The denominator is correct. It is the 
estimate of the error variance that is wrong. We can correct this by multiplying 
the standard errors obtained by d„/a„ , where 


^ “ 2 Oi - P>’ 2 ) 2 

°i = ^ 2 G>i - 0y 2 ) 2 


The latter is the expression we get from the OLS program at the second stage. 
Most computer programs make this correction anyway, so when using the stan¬ 
dard programs we do not have to worry about this problem. The discussion 
above is meant to show how the correct standard errors are obtained. 

The preceding results are also valid when there are exogenous variables in 
equation (9.13). We have considered a simple model for ease of exposition. 


Illustrative Example 

In Table 9.3 data are provided on commercial banks’ loans to business firms in 
the United States for 1979-1984. The following demand-supply model has been 
estimated: 4 Demand for loans by business firms: 


"The model postulated here is not necessarily the right model for the problem of analyzing the 
commercial loan market It is adequate for our purpose of a two-equation model with both 
equations over-identified 
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Qt - Po + + u, 

and supply by banks of commercial loans 

Q t = «o + a i R , + «2 RS, + « 3 y, + v, 


where Q, 

R, 

RS, 

RD, 

X, 


y t 


total commercial loans (billions of dollars) 
average prime rate charged by banks 
3-month Treasury bill rate (represents an alternative) 
rate of return for banks) 

AAA corporate bond rate (represents the price of 
alternative financing to firms) 
industrial production index and represents firms’ 
expectation about future economic activity 
total bank deposits (represents a scale variable) 
(billions of dollars) 


Both the equations are over-identified. So we chose to estimate them by 2SLS. 
R, is expected to have a negative sign in the demand function and a positive 
sign in the supply function. The coefficient of RS, is expected to be negative. 
The coefficients of RD„ X„ and y, are expected to be positive. Both the OLS 
and 2SLS estimates of the parameters had the expected signs. These estimates 
are presented in Table 9.4. Note that the R 2 might increase or decrease when 
we use 2SLS as compared to OLS. 

There are only minor changes in the 2SLS estimates compared to the OLS 
estimates for the demand function. As for the supply function, the only changes 
we see are in the parameters a, and a 2 (coefficients of R, and RS,). In this 
example this is quite important because what this shows is that quantity sup¬ 
plied is more responsive to changes in interest rates than is evidenced from the 
OLS estimates. 

If the R 2 values from the reduced-form equations are very high, the 2SLS 
and OLS estimates will be almost identical. This is because the y’s that we 
substitute in the 2SLS estimation procedure are very close to the corresponding 
y’s in case the R 2 's are very high. This is often the case in large econometric 
models with a very large number of exogenous variables. An example of this 
is the quarterly model of T. C. Liu. 5 Liu presents 2SLS and OLS estimates side 
by side and for some equations they are identical to three decimals. Many stud¬ 
ies however, do not report OLS and 2SLS estimates at the same time. 


9.7 The Question of Normalization 


Going back to equations (9.8), we notice that the coefficient of y, in the first 
equation and that of y 2 in the second equation are both unity. This is expressed 
by saying that the first equation is normalized with respect to y, and the second 

T. C. Liu, “An Exploratory Quarterly Model of Effective Demand in the Post-war U.S. Econ¬ 
omy,” Econometrica, Vol. 31, July 1963. 
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Table 9.3 Data for U.S. Commercial Loan Market January 1979-December 1984 
(Monthly) 


N 

Q 

R 

RD 

X 

RS 

y 

1 

251.8 

11.75 

9.25 

150.8 

9.35 

994.3 

2 

255.6 

11.75 

9.26 

151.5 

9.32 

1002.5 

3 

259.8 

11.75 

9.37 

152.0 

9.48 

994.0 

4 

264.7 

11.75 

9.38 

153.0 

9.46 

997.4 

5 

268.8 

11.75 

9.50 

150.8 

9.61 

1013.2 

6 

274.6 

11.65 

9.29 

152.4 

9.06 

1015.6 

7 

276.9 

11.54 

9.20 

152.6 

9.24 

1012.3 

8 

280.5 

11.91 

9.23 

152.8 

9.52 

1020.9 

9 

288.1 

12.90 

9.44 

151.6 

10.26 

1043.6 

10 

288.3 

14.39 

10.13 

152.4 

11.70 

1062.6 

11 

287.9 

15.55 

10.76 

152.4 

11.79 

1058.5 

12 

295.0 

15.30 

11.31 

152.1 

12.64 

1076.3 

13 

295.1 

15.25 

11.86 

152.2 

13.50 

1063.1 

14 

298.5 

15.63 

12.36 

152.7 

14.35 

1070.0 

15 

301.7 

18.31 

12.96 

152.6 

15.20 

1073.5 

16 

302.0 

19.77 

12.04 

152.1 

13.20 

1101.1 

17 

298.1 

16.57 

10.99 

148.3 

8.58 

1097.1 

18 

297.8 

12.63 

10.58 

144.0 

7.07 

1088.7 

19 

301.2 

11.48 

11.07 

141.5 

8.06 

1099.9 

20 

304.7 

11.12 

11.64 

140.4 

9.13 

1111.1 

21 

308.1 

12.23 

12.02 

141.8 

10.27 

1122.2 

22 

315.6 

13.79 

12.31 

144.1 

11.62 

1161.4 

23 

323.1 

16.06 

11.94 

146.9 

13.73 

1200.6 

24 

330.6 

20.35 

13.21 

149.4 

15.49 

1239.9 

25 

330.9 

20.16 

12.81 

151.0 

15.02 

1223.5 

26 

331.3 

19.43 

13.35 

151.7 

14.79 

1207.1 

27 

331.6 

18.04 

13.33 

151.5 

13.36 

1190.6 

28 

336.2 

17.15 

13.88 

152.1 

13.69 

1206.0 

29 

340.9 

19.61 

14.32 

151.9 

16.30 

1221.4 

30 

345.5 

20.03 

13.75 

152.7 

14.73 

1236.7 

31 

350.3 

20.39 

14.38 

152.9 

14.95 

1221.5 

32 

354.2 

20.50 

14.89 

153.9 

15.51 

1250.3 

33 

366.3 

20.08 

15.49 

153.6 

14.70 

1293.7 

34 

361.7 

18.45 

15.40 

151.6 

13.54 

1224.6 

35 

365.5 

16.84 

14.22 

149.1 

10.86 

1254.1 

36 

361.4 

15.75 

14.23 

146.3 

10.85 

1288.7 
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Table 9.3 ( Cont .) 


N 

Q 

R 

RD 

X 

RS 

y 

37 

359.8 

15.75 

15.18 

143.4 

12.28 

1251.5 

38 

364.6 

16.56 

15.27 

140.7 

13.48 

1258.3 

39 

372.4 

16.50 

14.58 

142.7 

12.68 

1295.0 

40 

374.7 

16.50 

14.46 

141.5 

12.70 

1272.1 

41 

379.3 

16.50 

14.26 

140.2 

12.09 

1286.1 

42 

386.7 

16.50 

14.81 

139.2 

12.47 

1325.8 

43 

384.4 

16.26 

14.61 

138.7 

11.35 

1307.3 

44 

384.5 

14.39 

13.71 

138.8 

8.68 

1321.7 

45 

395.0 

13.50 

12.94 

138.4 

7.92 

1335.5 

46 

393.7 

12.52 

12.12 

137.3 

7.71 

1345.2 

47 

398.9 

11.85 

11.68 

135.7 

8.07 

1358.1 

48 

395.3 

11.50 

11.83 

134.9 

7.94 

1409.7 

49 

392.4 

11.16 

11.79 

135.2 

7.86 

1385.4 

50 

392.3 

10.98 

12.01 

137.4 

8.11 

1412.6 

51 

395.9 

10.50 

11.73 

138.1 

8.35 

1419.5 

52 

393.5 

10.50 

11.51 

140.0 

8.21 

1411.0 

53 

391.7 

10.50 

11.46 

142.6 

8.19 

1413.1 

54 

395.3 

10.50 

11.74 

144.4 

8.79 

1443.8 

55 

397.7 

10.50 

12.15 

146.4 

9.08 

1438.1 

56 

400.6 

10.89 

12.51 

149.7 

9.34 

1461.4 

57 

402.7 

11.00 

12.37 

151.8 

9.00 

1448.9 

58 

405.3 

11.00 

12.25 

153.8 

8.64 

1459.0 

59 

412.0 

11.00 

12.41 

155.0 

8.76 

1499.4 

60 

420.1 

11.00 

12.57 

155.3 

9.00 

1508.9 

61 

424.4 

11.00 

12.20 

156.2 

8.90 

1504.1 

62 

428.8 

11.00 

12.08 

158.5 

9.09 

1499.3 

63 

433.1 

11.21 

12.57 

160.0 

9.52 

1494.5 

64 

439.7 

11.93 

12.81 

160.8 

9.69 

1501.5 

65 

447.3 

12.39 

13.28 

162.1 

9.83 

1541.3 

66 

452.9 

12.60 

13.55 

162.8 

9.87 

1532.9 

67 

454.4 

13.00 

13.44 

164.4 

101.2 

1535.5 

68 

455.2 

13.00 

12.87 

165.9 

10.47 

1539.0 

69 

459.9 

12.97 

12.66 

166.0 

10.37 

1549.9 

70 

467.7 

12.58 

12.63 

165.0 

9.74 

1578.9 

71 

468.7 

11.77 

12.29 

164.4 

8.61 

1578.2 

72 

476.8 

11.06 

12.13 

164.8 

8.06 

1631.2 


Source: Several issues of the Federal Reserve Bulletin. I would like to thank Walter Mayer 
for providing me with the data. 
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Table 9.4 OLS and 2SLS Estimates for the Demand-Supply Model of the 
Commercial Loan Market 



OLS 

2SLS“ 


Coefficient 

t-Ratio 

Coefficient 

t-Ratio 



Demand function 


Po 

-203.70 

-2.9 

-210.53 

-2.8 

P, 

-15.99 

-12.0 

-20.19 

-12.6 

P 2 

-2.29 

5.4 

2.34 

5.2 

P 3 

36.07 

14.2 

40.76 

14.4 

R 2 

0.7804 


0.7485 




Supply function 


«o 

-77.41 

-6.9 

-87.97 

-6.3 

<*! 

2.41 

2.9 

6.90 

3.6 

a 2 

-1.89 

-1.8 

-7.08 

-3.1 

«3 

0.33 

51.3 

0.33 

42.9 

R 2 

0.9768 


0.9666 



"For the 2SLS, the “/-ratio” is coefficient/asymptotic SE. 


equation with respect to y 2 . Sometimes this is taken to mean that y, is the 
“dependent” variable in the first equation and y 2 is the “dependent” variable 
in the second equation. This is particularly so in the 2SLS method. Strictly 
speaking, this goes contrary to the spirit of simultaneous equation models be¬ 
cause by definition y, and y 2 are jointly determined and we cannot label any 
single variable as the dependent variable in any equation. Of course, we have 
to assume the coefficient of one of the variables as 1 (i.e., normalize with re¬ 
spect to one variable). But the method of estimation should be such that it 
should not matter which variable we choose for normalization. The early meth¬ 
ods of estimation like full-information maximum likelihood (FIML) and lim¬ 
ited-information maximum likelihood (LIML) satisfied this requirement. But 
the more popular method like 2SLS does not and is, strictly speaking, not 
in the spirit of simultaneous equations estimation. A discussion of FIML is 
beyond the scope of this book, but we discuss LIML in Section 9.8. 

In practice, normalization is determined by the way the economic theories 
are formulated. For instance, in a macro model of income determination, al¬ 
though consumption C and income y are considered as jointly determined, we 
write the consumption function as 

C = a + fly + u 

and not as 


y = a' + P'C + u 
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This is because, at the level of individuals, consumption C is determined by 
income y. At the back of our minds there is a causal relationship from y to C 
at the micro level, and we carry this to the macro level as well. 

In an exactly identified system normalization does not matter. To see this, 
consider the first equation in (9.8). We saw earlier that we use z„ z 2 , Zi as 
instrumental variables. Since there is no problem of weighting these instrumen¬ 
tal variables, it does not matter how the equation is normalized. On the other 
hand, for the second equation in (9.8) we have the choice between z, and z 2 as 
instruments and the optimum instrumental variable is a weighted average of z, 
and z 2 . We saw that these weights were obtained from the reduced-form equa¬ 
tion for y,. On the other hand, if the equation is normalized with respect to y, 
and written as 


y i = b'fy 2 + C3Z3 + u ’ 2 

then the weights for z, and z 2 are obtained from the reduced-form equations for 
y 2 . Thus 2SLS and IV estimators are different for different normalizations in 
over-identified systems. 


*9.8 The Limited-Information Maximum 
Likelihood Method 


The LIML method, also known as the least variance ratio (LVR) method is the 
first of the single-equation methods suggested for simultaneous-equations 
models. It was suggested by Anderson and Rubin 6 in 1949 and was popular 
until the advent of the 2SLS introduced by Theil 7 in the late 1950s. The LIML 
method is computationally more cumbersome, but for the simple models we 
are considering, it is easy to use. 

Consider again the equations in (9.8). We can write the first equation as 

y\ = yi - b t y 2 = c,z, + C 2 z 2 + H, (9.17) 

For each fc, we can construct a yj. Consider a regression of y, on Z\ and z 2 only 
and compute the residual sum of squares (which will be a function of /?,). Call 
it RSS,. Now consider a regression of y, on all the exogenous variables z,, z 2 , 
z 3 and compute the residual sum of squares. Call it RSS,. What equation (9.17) 
says is that z 3 is not important in determining yj. Thus the extra reduction in 
RSS by adding z 3 should be minimal. The LIML or LVR method says that we 
should choose b ] so that (RSS, - RSS 2 )/RSS, or RSS,/RSS 2 is minimized. After 


*T. W. Anderson and H. Rubin, “Estimation of the Parameters of a Single Equation in a Com¬ 
plete System of Stochastic Equations,” Annals of Mathematical Statistics, Vol. 20, No. I, 
March 1949. 

7 H. Theil, Economic Forecasts and Policy (Amsterdam: North-Holland, 1958). 
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/?, is determined, the estimates of c, and c 2 are obtained by regressing >>, on z, 
and z 2 . The procedure is similar for the second equation in (9.8). 

There are some important relationships between the LIML and 2SLS meth¬ 
ods. (We will omit the proofs, which are beyond the scope of this book.) 

1. The 2SLS method can be shown to minimize the difference (RSS, — 
RSS 2 ), whereas the LIML minimizes the ratio (RSS,/RSS 2 ). 

2. If the equation under consideration is exactly identified, then 2SLS and 
LIML give identical estimates. 

3. The LIML estimates are invariant to normalization. 

4. The asymptotic variances and covariances of the LIML estimates are the 
same as those of the 2SLS estimates. However, the standard errors will 
differ because the error variance a\ is estimated from different estimates 
of the structural parameters. 

5. In the computation of LIML estimates we use the variances and covari¬ 
ances among the endogenous variables as well. But the 2SLS estimates 
do not depend on this information. For instance, in the 2SLS estimation 
of the first equation in (9.8), we regress y, on y 2 , z t , and z 2 . Since y 2 is a 
linear function of the z’s we do not make any use of cov(y,, y 2 ). This 
covariance is used only in the computation of 6f r 

Illustrative Example 

Consider the demand and supply model of the Australian wine industry dis¬ 
cussed in Section 9.5. Since the demand function is exactly identified, the 2SLS 
and LIML estimates would be identical and they are also identical to the in¬ 
strumental variable estimates (using S as an instrument) presented earlier in 
Section 9.5. 

As for the supply function, since it is over-identified, the 2SLS and LIML 
estimates will be different. The following are the 2SLS and LIML estimates of 
the parameters of the supply function (as computed from the SAS program). 
All variables are in logs: 


Variable 


2SLS 


LIML 


Coefficient 

SE 

t-Ratio 

Coefficient 

SE 

t-Ratio 

Intercept 

-16.820 

1.080 

-15.57 

- 16.849 

1.087 

-15.49 

P K 

2.616 

0.331 

7.89 

2.627 

0.334 

7.86 

S 

1.188 

0.190 

6.24 

1.183 

0.191 

6.18 


R 2 (as computed 

by the SAS 0.9548 0.9544 

program) 


Actually, in this particular example, the LIML and 2SLS estimates are not 
much different. But this is not usually the experience in many studies of over¬ 
identified models. 
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*9.9 On the Use of OLS In the Estimation 
of Simultaneous-Equations Models 


Although we know that the simultaneity problem results in inconsistent esti¬ 
mators of the parameters, when the structural equations are estimated by or¬ 
dinary least squares (OLS), this does not mean that OLS estimation of simul¬ 
taneous-equations models is useless. In some instances we may be able to say 
something about the direction of the (large-sample) bias, and this would be 
useful information. Also, if an equation is under-identified, it does not neces¬ 
sarily mean that nothing can be said about the parameters in that equation. 

Consider the demand and supply model: 

q, = Pp, + u, demand function 

q, — ap, + v, supply function 


q, and p, are in log form and are measured in deviations from their means. Thus 
P and a are the price elasticities of demand and supply respectively. Let var(n,) 
= of, var (v,) = of, and cov(n„ v,) = cr„ v . The OLS estimator of p is 


P = 


Z Pi 


P + 


Z PtU, 

Z p 2 t 


(1/n) Z, P, u , 
(1 In) Z P 2 < 


where n is the sample size. Now 


or 


pp, + u, = ap, + v, 


Pi 


V, — u, 


Hence we have 


and 


Pi™ ^ Z Pi«i^ 


P - a 




(r m 

— orj 

= cov(p„ «,) 




p 

— a 

T 

07 

+ of, 

— 2ct,„, 

var (p,) = — 




Thus 


plim p = p + (p - a) 


CT-IV - 07, 

0? + 0?, - 2cr„ 
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This expression is not very useful, but if cr„, = 0 (i.e., the shifts in the demand 
and supply functions are unrelated), we get 

plim P = P + (P - «> 

Now (3 is expected to be negative and a positive. Thus the bias term is expected 
to be positive. Hence if we find a price elasticity of demand of -0.8, the true 
price elasticity is < -0.8. 

We can show by a similar reasoning that if we estimate the supply function 
by OLS we get 

X pa, 

2 pi 


plim a = a + (P - a) ——-—; 

The bias is now negative since (p - a) is expected to be negative. Again, 
suppose that the supply elasticity is estimated to be 0.6; then we know that the 
true price elasticity is > 0.6. 

In practice, when we regress q on p we do not know whether we are esti¬ 
mating the demand function or the supply function. However, if the regression 
coefficient is, say, +0.3, we know that the supply elasticity is 0.3 + a positive 
number and the demand elasticity is 0.3 + a negative number. Thus the prac¬ 
tically useful conclusion is that the supply elasticity is > 0.3. On the other 
hand, if the regression coefficient is -0.9, we know that the supply elasticity 
is -0.9 + a positive number and the demand elasticity is —0.9 + a negative 
number. Thus the practically useful conclusion is that the demand elasticity 
< -0.9 (or greater than 0.9 in absolute value). 

In the example above we also note that 

Pl * m ( x^' ) P ^ ^ ° 

—> a as or; —» 0 

Since crj is the variance of the random shift in the demand function, of, —» 0 
means that the demand function does not shift or is stable. Thus the regression 
of q on p will estimate the demand elasticity if the demand function is stable 
and the supply elasticity if the supply function is stable. 

Figure 9.2 gives the case of a supply curve shifting and the demand curve 
not shifting. What we observe in practice are the points of intersection of the 
demand curve and the supply curve. As can be seen, the locus of these points 
of intersection traces out the demand curve. In Figure 9.3, where the demand 


a 


and if a uv = 0, then 
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Figure 9.2. Stable demand and shifting supply. 


curve shifts and the supply curve does not, this locus determines the supply 
curve. 8 

Working’s Concept of Identification 

After concluding that if the demand curve did not shift much, but the supply 
curve did, then the locus of the points of intersection would come close to 
tracing a demand curve, Working added that by “correcting” for the influence 
of an additional important determining variable (such as income) in the demand 
curve, we could reduce its shifts and hence get a better approximation of the 
price elasticity of demand. If we add income to the demand function, we get 
the equation system 

q, = Pp, + yy, + u, demand function 

q, — ap, + v, supply function 

What Working said was that by introducing y, in the demand function, we can 
get a better estimate of p. What he had in mind was that the introduction of y, 
reduced the variance of u,. 

If we use the results on identification discussed in Sections 9.3 and 9.4, we 
come to the conclusion that the supply equation is identified (it has one missing 
variable) but the demand function is not. This is exactly the opposite of what 
Working said. The problem here is that the concept of identification we have 

8 This was the main conclusion of the article by E. J. Working, “What Do Statistical Demand 
Curves Show?” Quarterly Journal of Economics , February 1927. 
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Figure 9.3. Stable supply and shifting demand. 


discussed is in terms of our ability to get consistent estimates. What Working 
was concerned about is the least squares bias. His argument is that if of, is 
somehow reduced, we get good estimates by OLS (i.e., the bias would be neg¬ 
ligible). Consider two estimates p, and p 2 for p. Suppose that 

plim Pi = P and plim p 2 = 0.999P 

then p, is consistent but p 2 is not. But for all practical purposes p 2 is also 
consistent. 

Suppose that we include all the relevant explanatory variables in the demand 
function so that of is very small, whereas the supply function includes only 
price as the “explanatory” variable. Then, even if the demand equation is not 
identified by the rules we discussed earlier, we are still justified in estimating 
the equation by OLS. On the other hand, even if the supply function is over¬ 
identified, if of is very large, we get a poor estimate of the supply elasticity a. 
We can get a consistent estimator for a, but its variance will be very high. 

In summary, it is not true that if an equation is under-identified, we have to 
give up all hopes of estimating it. Nor does it follow that an equation that is 
over-identified can be better estimated than one that is exactly-identified or 
under-identified. Further discussion of the demand supply model considered 
here can be found in Maddala 9 and Learner. 10 

9 G. S. Maddala, Econometrics (New York: McGraw-Hill, 1977), pp. 244-249. 

I0 E. E. Learner, “Is It a Demand Curve or Is It a Supply Curve: Partial Identification Through 
Inequality Constraints,” The Review of Economics and Statistics, Vol. 63. No. 3, August 1981, 
pp. 319-327. 
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In simultaneous equations models the classification of variables as endoge¬ 
nous and exogenous is quite arbitrary. Often, current values of some variables 
are treated as endogenous and lagged values of these same variables are treated 
as exogenous even when the current and lagged values are very highly cor¬ 
related. In addition, some equations are regarded as identified (over- or ex¬ 
actly-) and others are not identified by merely looking at how many variables 
are missing from the equation, and the under-identified equations are regarded 
as nonestimable. The discussion above illustrates that least squares estimation 
of under-identified equations can still be worthwhile. Also, instead of simply 
counting the number of exogenous variables, one should also look at how high 
their inter-correlations are. 


Recursive Systems 

Not all simultaneous equations models give biased estimates when estimated 
by OLS. An important class of models for which OLS estimation is valid is 
that of recursive models. These are models in which the errors from the differ¬ 
ent equations are independent and the coefficients of the endogenous variables 
show a triangular pattern. For example, consider a model with three endoge¬ 
nous variables: y u y 2 , and y 3 , and three exogenous variables, z u z 2 , and z 3 , 
which has the following structure: 

+ 0i2y 2 + Pi3y 3 + «,z, = w, 

y 2 + (Ws + oc 2 z 2 = « 2 

y 3 + a 3 z 3 = w 3 

where u 2 , « 3 are independent. The coefficients of the endogenous variables 
are 

1 Pl2 0.3 
1 023 

1 

which form a triangular structure. In such systems, each equation can be esti¬ 
mated by ordinary least squares. Suppose that in the example above u 2 and « 3 
are correlated but they are independent of w,; then the model is said to be block- 
recursive. The second and third equations have to be estimated jointly, but the 
first equation can be estimated by OLS. Note that y 2 and y 3 depend on u 2 and 
w 3 only and hence are independent of the error term u,. 

Estimation of Cobb—Douglas Production Functions 

In the estimation of production functions, output, labor input, and capital input 
are usually considered the endogenous variables and wage rate and price of 
capital as the exogenous variables. 
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Suppose that we use the Cobb-Douglas production function 

X, = ALf Kf <?"" (9.18) 

where u u is the error term; then taking logs we can write it as 

y., - ay* - Py 3 , = a + w„ (9.19) 

where y„ = log X„ y 2 , = log L„ y 3 , = log K„ and c, = log A. 

Since y 2l and y 3 , are endogenous variables, we cannot estimate this equation 
by OLS. We have to add the equations for y 2l and y 3 , and these are obtained 
from the marginal productivity conditions. However, Zellner, Kmenta, and 
Dreze" argue that in this case we can estimate equation (9.19) by OLS. Their 
argument is that since output given (9.18) is stochastic, the firm should be max¬ 
imizing expected profits (not profits). If u u ~ IN(0, or 2 ), then 

E(e“") - exp (|cr 2 ) 

Hence 


E(X,) = AL?Kf exp (V) 

Expected profits = R, = p,E(X ,) - w,L, - r,K„ where p is the price of output, 
w the price of labor input, and r the price of capital input. Maximization of 
expected profits gives us the following conditions: 


dR, 

dL; 


dR, 

dK, 


0 => y = — exp (w,, - fcr 2 ) 
L, ap, 

° => 5 = exp («i, - y 2 ) 

K, PA 


Taking logs and adding error terms u 2l and m 3 , to these equations, we get (the 
errors depict errors in maximization of expected profits) 


where 


yi, ~ y* = c 2 , + Uu + u 2l 
yu - yi, = c i, + u u + « 3 , 


C 2 , = log 




' Ci, = log 




(9.20) 


Now we can solve equations (9.19) and (9.20) to get the reduced forms for y,„ 
y 2 „ y 3( . We can easily see that when we substitute for y,, from equation (9.19) 


"A. Zellner, J. Kmenta, and J. Dreze, “Specification and Estimation of Cobb-Douglas Produc¬ 
tion Function Models,” Econometrica, October 1966, pp. 786-795. 
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into equations (9.20), the u u term cancels. Thus y 2/ and y 3/ involve only the error 
terms u 2i and u v . 

It is reasonable to assume that u u are independent of u 2i and u 3i because u u 
are errors due to “acts of nature” like weather and u 2i and u y are “human 
errors.” Under these assumptions y 2i and y 3 , (which depend on u 2i and « 3 , only) 
are independent of u u . Hence OLS estimation of the production function (9.19) 
yields consistent estimates of the parameters a and 0. We thus regress y,, on 
y 2( and y 3( . 

Here is an example where there is no simultaneity bias by using the OLS 
method. The model is not recursive either, but from the peculiar way the error 
terms entered the equations we could show that in the first equation the in¬ 
cluded endogenous variables are independent of the error term in that equation. 

*9.10 Exogeneity and Causality 


The approach to simultaneous equations models that we have discussed until 
now is called the Cowles Foundation approach. Its name derives from the fact 
that it was developed during the late 1940s and early 1950s by the econometri¬ 
cians at the Cowles Foundation at the University of Chicago. The basic premise 
of this approach is that the data are assumed to have been generated by a sys¬ 
tem of simultaneous equations. The classification of variables into “endoge¬ 
nous” and “exogenous,” and the causal structure of the model are both given 
a priori and are untestable. The main emphasis is on the estimation of the un¬ 
known parameters for which the Cowles Foundation devised several methods 
(limited-information and full-information methods). 

This approach has, in recent years, been criticized on several grounds: 

1. The classification of variables into endogenous and exogenous is some¬ 
times arbitrary. 

2. There are usually many variables that should be included in the equation 
that are excluded to achieve identification. This argument was made by 
T. C. Liu 12 in 1960 but did not receive much attention. It is known as the 
Liu critique. 

3. One of the main purposes of simultaneous equations estimation is to fore¬ 
cast the effect of changes in the exogenous variables on the endogenous 
variables. However, if the exogenous variables are changed and profit- 
maximizing agents see the change coming, they would modify their be¬ 
havior accordingly. Thus the coefficients in the simultaneous equations 
models cannot be assumed to be independent of changes in the exoge¬ 
nous variables. This is now called the Lucas critique . 13 

IJ T. C. Liu, “Under-Identification, Structural Estimation, and Forecasting,” Econometrica, 
Vol. 28, 1960, pp. 855-865. 

IJ R. E. Lucas, “Econometric Policy Evaluation: A Critique,” in Karl L. Brunner (ed.). The 
Phillips Curve and Labor Markets (supplement to the Journal of Monetary Economics ), 1976, 
pp. 19-46. 
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One solution to the Lucas critique is to make the coefficients of the simul¬ 
taneous equations system depend on the exogenous policy variables. Since this 
makes the model a varying parameter model, which is beyond the scope of this 
book, we will not pursue it here. 14 

Learner 15 suggests redefining the concept of exogeneity. He suggests defining 
the variable x as exogenous if the conditional distribution of y given x is invar¬ 
iant to modifications that alter the process generating x. What this says is that 
a variable is defined as exogenous if the Lucas critique does not apply to it. It 
is, however, not clear whether such a redefinition solves the problem raised by 
Lucas. 

There are two concepts of exogeneity that are usually distinguished: 

1. Predeterminedness. A variable is predetermined in a particular equation 
if it is independent of the contemporaeneous and future errors in that 
equation. 

2. Strict exogeneity. A variable is strictly exogenous if it is independent of 
the contemporaneous, future, and past errors in the relevant equation. 

To explain these concepts we have to consider a model with lagged variables. 
Consider 16 


y, — a/X, + p n y,_i + Pi 2 A/-j + u It (9 21) 

x, = « 2 >’( + P 21 .W 1 + P 22 -V-I + « 2 , 

with u u and u 2l mutually and serially independent. If a 2 = 0, x, is predetermined 
for y, in the first equation. On the other hand, x, is strictly exogenous for y, 
only if a 2 = 0 and p 2l = 0. Because if P 21 ¥= 0. x, depends on u lt , through 
y,_,. That is, in the first equation x, is not independent of past errors. 

In nondynamic models, and models with no serial correlation in the errors, 
we do not have to make this distinction. For instance, in Section 9.9 we con¬ 
sidered an example of estimation of the Cobb-Douglas production function un¬ 
der the hypothesis of maximization of expected profit. We saw that y 2 and y 3 
were independent of the error term Thus these variables are exogenous for 
the estimation of the parameters in the production function. Similarly, in a re¬ 
cursive system, the (nonnormalized) endogenous variables in each equation 
can be treated as exogenous for the purpose of estimation of the parameters in 
that equation. 

Engle, Hendry, and Richard 17 are not satisfied with the foregoing definitions 
of exogeneity and suggest three more concepts: 


l4 This is the solution suggested in Maddala, Econometrics, Chap. 17, “Varying Parameter 
Models.” 

,5 E. E. Learner, “Vector Autoregressions for Causal Inference,” in K. Brunner and A. Meltzer 
(eds.). Understanding Monetary Regimes (supplement to Journal of Monetary Economics ), 
1985, pp. 255-304. 

“This example and the discussion that follows is from R. L. Jacobs, E. E. Learner, and M. P. 
Ward, “Difficulties with Testing for Causation,” Economic Inquiry, 1979, pp. 401-413. 

I7 R. F. Engle, D. F. Hendry, and J. F. Richard, “Exogeneity,” Econometrica, Vol. 51, March 
1983. 
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1. Weak exogeneity. 

2. Strong exogeneity. 

3. Super exogeneity. 

Since these concepts are often used and they are not difficult to understand 
anyway, we will discuss them briefly. The concept of strong exogeneity is 
linked to another concept: “Granger causality.” There is a proliferation of 
terms here but since they occur frequently in recent econometric literature and 
are not difficult to understand, we will go through them. 

One important point to note is that whether a variable is exogenous or not 
depends on the parameter under consideration. Consider the equation 18 

y = bx + u 

where b is the unknown parameter and u is a variable that is unknown and 
unnamed. This equation can also be written as 

y = b*x + v 

where b* = b + 1 and v = u - x. Again, b* is an unknown parameter and v 
is an unknown variable. If E(u\x) = 0, it cannot be true that E(v\x) = 0. So is 
x exogenous or not? It clearly depends on the parameter value. This points to 
the importance of the question: “Exogenous for what?” 

As yet another example, consider the case where y, and x, have a bivariate 
normal distribution with means E(y,) = p,, E(x,) = p 2 and variances and co- 
variance given by var(y,) = <r u , var(x,) = cr 22 , and cov(y„ x,) = cr 12 . The con¬ 
ditional distribution of y, given x, is 

y\x, ~ IN(a + or 2 ) 

where (3 = or 12 /cr 22 , a = p, — (3p 2 , and <t 2 = or,, - o 2 2 /a 22 . 

We can write the joint distribution of y, and x, as 

/O',, *,) = g(y,\x,Mx} 

and we can write the model as 

y, = a + 0x, + 

*1=1*2+ V 2, 

where cov(x„ «„) = 0 and cov(«„, v 2( ) = 0 by construction. If we consider this 
set of equations, x, is “exogenous.” 

On the other hand, we can similarly write 

/Or, *,) = h(x,\y,)g(y,) 

and write the model as 


u tl ~ IN(0, a 2 ) 
V 2/ ~ IN(0, cr 22 ) 


(9.22) 


'"The example is from J. W. Pratt and R. Schlaifer, “On the Nature and Discovery of Struc¬ 
ture,” Journal of the American Statistical Association, Vol. 79, March 1984, pp. 9-21. It is also 
discussed in Learner, “Vector Autoregressions.” 
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X, = 7 + Sy, 4- u 2t u 2 , ~ (IN)(0, to 2 ) (9.23) 

y, = M-i + v„ ~ IN(0, or,,) 

where cov(y„ u 2l ) = cov(w 2 „ v„) = 0 by construction and 8 = cr^/cr,, , y = 
p 2 — 8p,, to 2 = cr 22 - o 7 2 /cr u . Now y, is “exogenous” in this model. So which 
of x, and y, is exogenous, if at all? The answer depends on the parameters of 
interest. If we are interested in the parameters a, (3, <r 2 , then x, is exogenous 
and equations (9.22) are the ones to estimate. If we are interested in the param¬ 
eters 7 , 8 , to 2 , then y, is exogenous and equations (9.23) are the ones to estimate. 

The considerations above led to the following definition of weak exogeneity 
by Engle, Hendry, and Richard: 


Weak Exogeneity 

A variable x, is said to be weakly exogenous for estimating a set of parameters 
X if inference on X conditional on x, involves no loss of information. That is, if 
we write 

/O',, x,) = g(y,\x,)h(x,) 

where g(y,|x,) involves the parameters X, weak exogeneity implies that the mar¬ 
ginal distribution h(x,) does not involve the parameters X. Essentially, the pa¬ 
rameters in h(x ,) are nuisance parameters. 

In the example where y, and x, have a bivariate normal distribution, there are 
five parameters: p,, p 2 , «t m , <r 22 , or,,. These can be transformed by a one-to-one 
transformation into (a, (3, or 2 ) and (p 2 , cr 22 ). The two sets are separate. Thus for 
the estimation of (a, (3, a 2 ) we do not need information on (p 2 , rr 22 ). Hence x, is 
weakly exogenous for the estimation of (a, (3, or). 


Superexogeneity 

The concept of superexogeneity is related to the Lucas critique. If x, is weakly 
exogenous and the parameters in f(y,\x,) remain invariant to changes in the 
marginal distribution of x„ then x, is said to be superexogenous. In the example 
we have been considering, namely that of y, and x, having a bivariate normal 
distribution, x, is weakly exogenous for the estimation of (a, (3, or 2 ) in (9.22). 
But it is not superexogenous because if we change p, and cr 22 , the parameters 
in the marginal distribution of x„ this will produce changes in (a, (3, a 2 ). Note 
that weak exogeneity is a condition required for efficient estimation. Superex¬ 
ogeneity is a condition required for policy purposes. 

Learner finds it unnecessary to require weak exogeneity as a condition for 
superexogeneity. He argues that this confounds the problem of efficient esti¬ 
mation with that of policy analysis. His definition of exogeneity is the same as 
the definition of superexogeneity by Engle, Hendry, and Richard without the 
requirement of weak exogeneity. 
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Strong Exogeneity 

If x, is weakly exogenous and x, is not preceded by any of the endogenous 
variables in the system, x, is defined to be strongly exogenous. As an example, 
consider the model 


y, = P*, + 

x, = + a 2 y,_‘i + u 2 , 

(«,„ u 2l ) have a bivariate normal distribution and are serially independent; 
var(M„) = (Tj), var (u 2t ) = or 22 , and cov(«„, u 2l ) = cr 12 . If or , 2 = 0, then x, is 
weakly exogenous because the marginal distribution of x, does not involve (3 
and tr M . However, the second equation shows that y, precedes x,. (x, depends 
on y ( _,.) Hence x, is not strongly exogenous. 

We have used the word “precedence” following Learner but the definition 
by Engle, Hendry, and Richard is in terms of a concept called Granger caus¬ 
ality and is as follows. If x, is weakly exogenous and x, is not caused in the 
sense of Granger by any of the endogenous variables in the system, then x, is 
defined to be strongly exogenous. Simply stated the term “Granger causality” 
means “precedence” but we will discuss it in greater detail. 

Granger Causality 

Granger starts from the premise that the future cannot cause the present or the 
past. If event A occurs after event B, we know that A cannot cause B. At the 
same time, if A occurs before B, it does not necessarily imply that A causes B. 
For instance, the weatherman’s prediction occurs before the rain. This does 
not mean that the weatherman causes the rain. In practice, we observe A and 
B as time series and we would like to know whether A precedes B, or B pre¬ 
cedes A, or they are contemporaneous. For instance, do movements in prices 
precede movements in interest rates, or is it the opposite, or are the movements 
contemporaneous? This is the purpose of Granger causality. It is not causality 
as it is usually understood. 

Granger 19 devised some tests for causality (in the limited sense discussed 
above) which proceed as follows. Consider two time series, {y,} and {*,}. The 
series x, fails to Granger cause y, if in a regression of y, on lagged y’s and lagged 
x's, the coefficients of the latter are zero. That is, consider 

k k 

y t = X a <y>-. + 2 + «< 

1 = 1 1=1 

Then if (3, = 0 (/ = 1,2 , . . . , k), x, fails to cause y,. The lag length k is, to 
some extent, arbitrary. 

I9 C. W. J. Granger, "Investigating Causal Relations by Econometric Models and Cross-Spectral 
Methods,” Econometrica, Vol. 37, January 1969, pp. 24-36. 
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An alternative test provided by Sims 20 is as follows: x, fails to cause y, in the 
Granger sense if in a regression of y, on lagged, current, and future x’s, the 
latter coefficients are zero. Consider the regression 

*2 

y, = 2 ( 3 ,-Xf-i + u, 

Test p .j = 0 O' = C 2, . . . , k ,). What this says is that the prediction of y from 
current and past x’s would not be improved if future values of x are included. 
There are some econometric differences between the two tests, but the two 
tests basically test the same hypothesis. 21 

As mentioned earlier. Learner suggests using the simple word “precedence” 
instead of the complicated word Granger causality since all we are testing is 
whether a certain variable precedes another and we are not testing causality as 
it is usually understood. However, it is too late to complain about the term 
since it has already been well established in the econometrics literature. Hence 
it is important to understand what it means. 

Granger Causality and Exogeneity 

As we defined earlier, Granger noncausality is necessary for strong exogeneity 
as defined by Engle, Hendry, and Richard. Sims also regards tests for Granger 
causality as tests for exogeneity. 22 However, Granger noncausality is neither 
necessary nor sufficient for exogeneity as understood in the usual simultaneous 
equations literature. 23 This point can be illustrated with the example in equa¬ 
tions (9.21). We said that x, is predetermined for y, in the first equation if a 2 = 
0 , and x, is strictly exogenous for y, if a 2 = 0 and p 2 , = 0. 

Now to see what the Granger test does, write the reduced forms for y, 
and x,\ 

y, = ^uy,-\ + ^12^-1 + 

x, = + Tr 22 x (-1 + V 7 , 

For Granger noncausality, we have to have tt 2) = 0. But 

a 2Pii + P21 

^21 = - 

1 - a,a 2 

Thus tt 21 = 0 implies that a 2 p,, + p 2 , = 0. From this it does not follow that 
a 2 = 0. Thus Granger noncausality does not necessarily imply that x, is pre- 

’°C. A. Sims, “Money, Income and Causality,” American Economic Review. Vol. 62, 1972. pp. 
450-552. 

2I G. Chamberlain, “The General Equivalence of Granger and Sims Causality,” Econometrica, 
Vol. 50, 1982, pp. 569-582. 

22 Sims, “Money,” p. 550. 

21 T. F. Cooley and S. F. LeRoy, “ A-theoretical Macroeconometrics,” Journal of Monetary Eco¬ 
nomics, Vol. 16, No. 3, November 1985, pp. 283-308, see Sec. 5 on Granger causality and 
Cowles causality. 
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determined. Conversely, a 2 = 0 does not imply that ir 2l = 0. However, a 2 = 0 
and p,i = 0 implies that tt 21 = 0, although the converse is not true. 

Thus a test for Granger noncausality is not useful as a test for exogeneity. 
Some argue that it is, nevertheless, useful as a descriptive device for time- 
series data. 


Tests for Exogeneity 

The Cowles Foundation approach to simultaneous equations held the view that 
causality and exogeneity cannot be tested. These are things that have to be 
specified a priori. In recent years it has been argued that if some variables are 
specified as exogenous and the equation is identified, one can test whether 
some other variables considered endogenous are indeed endogenous or not. 

As an illustration, consider the following. We have a simultaneous equations 
model with three endogenous variables y„ y 2 , and y 3 , and three exogenous vari¬ 
ables z,, z 2 , and z 3 . Suppose that the first equation of the model is 

yi = P 2 y 2 + P 3 y 3 + + «■ 

We want to test whether y 2 and y 3 can be treated as exogenous for the estima¬ 
tion of this equation. To test this hypothesis, we obtain the predicted values y 2 
and y 3 of y 2 and y 3 , respectively, from the reduced form equations for y 2 and y 3 . 
We then estimate the model 

y, = 0 2 y 2 + 0 3 y 3 + a,z, + y 2 $ 2 + y^ 3 + w, 

by OLS and test the hypothesis: y 2 = y 3 = 0 (using the F-test described in 
Chapter 4). If the hypothesis is rejected, y 2 and >> 3 cannot be treated as exoge¬ 
nous. If it is not rejected, y 2 and y 3 can be treated as exogenous. 24 


Summary 


1. In simultaneous equations models, each equation is normalized with re¬ 
spect to one endogenous variable. Strictly speaking, since the endogenous vari¬ 
ables are all jointly determined, it should not matter which variable is chosen 
for normalization. However, some commonly used methods of estimation (like 
the two-stage least squares) do depend on the normalization rule adopted. In 
practice, normalization is determined by the way economic theories are for¬ 
mulated. 

24 The test procedure described here is Hausman’s test discussed in greater detail in Section 
12.10. It is equivalent to some other tests suggested in the literature. Two references are D. 
Wu, “Alternative Tests of Independence Between Stochastic Regressors and Disturbances,” 
Econometrica, Vol. 41, 1973, pp. 733-750, and N. S. Revankar, “Asymptotic Relative Effi¬ 
ciency Analysis of Certain Tests of Independence in Structural Systems.” International Eco¬ 
nomic Review, Vol. 19, 1978, pp. 165-179. 
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2. In a demand and supply model, if quantity supplied is not responsive to 
price, the demand function should be normalized with respect to price. On the 
other hand, if quantity supplied is highly responsive to price, the demand func¬ 
tion should be normalized with respect to quantity. (See Section 9.1 for a graph¬ 
ical discussion and Section 9.3 for an empirical illustration.) 

3. Before a simultaneous equations model is estimated, one should check 
whether each equation is identified or not. In linear simultaneous equations 
systems, a necessary condition for the identification of an equation is the order 
condition which says that the number of variables missing from the equation 
should be greater than or equal to the number of endogenous variables in the 
equation minus one. This counting rule is only a necessary condition. One also 
has to check the rank condition which is based on the structure of the missing 
variables in the other equations. This is illustrated with some examples in Sec¬ 
tion 9.4. 

4. It is customary to classify an equation into the over-identified, exactly 
identified, and under-identified categories according as the number of variables 
missing from the equation is, respectively, greater than, equal to, or less than 
the number of endogenous variables minus one. It is not possible to get con¬ 
sistent estimates of the parameters in the under-identified equations. The dif¬ 
ference between over-identified and exactly identified equations is simply that 
the latter are easier to estimate than the former. 

5. We have discussed only single-equation methods, that is, estimation of 
each equation at a time. The methods we have discussed are: 

(a) The instrumental variable (IV) methods (Section 9.5). 

(b) The two-stage least squares (2SLS) method (Section 9.6). 

(c) The limited-information maximum likelihood (LIML) method (Section 9.8). 
For an exactly identified equation, all the methods are equivalent and give the 
same answers. For an over-identified equation the IV method gives different 
estimates depending on which of the missing exogenous variables are chosen 
as instruments. The 2SLS method is a weighted instrumental variable method. 

6 . In over-identified equations, the 2SLS estimates depend on the normali¬ 
zation rule adopted. The LIML estimates do not depend on the choice of nor¬ 
malization. The LIML method is thus a truly simultaneous estimation method, 
but the 2SLS is not because, strictly speaking, since the endogenous variables 
are jointly determined, normalization should not matter. 

7. It is not always true that we can say nothing about the parameters of an 
under-identified equation. In some cases the ordinary least squares (OLS) es¬ 
timates, even if they are not consistent, do give us some information about the 
parameters. Some examples are given in Section 9.9. It is also interesting to 
note that the concept of identification as discussed by Working in 1927 is quite 
different from the current discussion which is concentrated on getting consis¬ 
tent estimators for the parameters (see Section 9.9). 

8 . There are some cases where the simultaneous equations model can be 
estimated by using the OLS method. One such example is the recursive model. 
We also give another example—that of estimation of the Cobb-Douglas pro- 
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duction function under uncertainty where the OLS method is the appropriate 
one (see Section 9.9). 

9. In recent years, the usual definition of exogeneity has been questioned. 
Some new terms have been introduced, such as weak exogeneity, strong exo¬ 
geneity, and super exogeneity. One important question that has been raised is: 
“Exogenous for what?” If it is for efficient estimation of the parameters (this 
is the concept of weak exogeneity), a variable can be treated as endogenous in 
one equation and exogenous in another (as in a recursive system). Also, 
whether a variable is exogenous or not depends on the parameter to be esti¬ 
mated (see Pratt’s example). 

10. Strong exogeneity is weak exogeneity plus Granger causality. The term 
“Granger causality” has nothing to do with causality as it is usually under¬ 
stood. A better term for it is “precedence.” Some econometricians have 
equated the concept of exogeneity with Granger causality. The example in Sec¬ 
tion 9.10 shows that linking Granger causality to exogeneity has some pitfalls. 
It is better to keep the two concepts separate. 

11. A variable is considered superexogenous if interventions in that variable 
leave the parameters in the system unaltered. It is a concept that has to do with 
Lucas’s critique of econometric policy evaluation. As a definition, the concept 
is all right. But from the practical point of view its use is questionable. 

12. There have been some tests of exogeneity suggested in the context of 
simultaneous equations systems. These tests depend on the availability of extra 
instrumental variables. The tests are easy to apply because they depend on the 
addition of some constructed variables to the usual models, and testing that 
the coefficients of these added variables are zero. 


Exercises 


1. Explain the meaning of each of the following terms. 

(a) Endogenous variables. 

(b) Exogenous variables. 

(c) Structural equations. 

(d) Reduced-form equations. 

(e) Order condition for identification. 

(f) Rank condition for identification. 

(g) Indirect least squares. 

(h) Two-stage least squares. 

(i) Instrumental variable methods. 

(j) Normalization. 

(k) Simultaneity bias. 

(l) Recursive systems. 

2. Explain concisely what is meant by “the identification problem” in the 
context of the linear simultaneous equations model. 
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3. Consider the three-equation model 

3*1 = 0133*3 + 7 ) 2*2 + U \ 

3*2 = 02 ) 3*1 + 0233*3 + 721*1 + 722*2 + «2 
3*3 = 733*3 + «3 

where y u y 2 , and y 3 are endogenous, and x„ x 2 , and jc 3 are exogenous. Dis¬ 
cuss the identification of each of the equations of the model, based on the 
order and rank conditions. 

Now suppose that you want to estimate the first equation by two-stage 
least squares, but you have only an ordinary least squares program avail¬ 
able. Explain carefully, step by step, how you would estimate 0, 3 , 7 12 , and 
var(«,). 

4. Consider the model 

3*i — “3*2 + 8 * + »i 
3*2 = 03*1 + 7* + «2 

where x is exogenous, and the error terms u t and u 2 have mean zero and 
are serially uncorrelated. 

(a) Write down the equations expressing the reduced-form coefficients 
in terms of the structural parameters. 

(b) Show that if -y = 0, then (3 can be identified. Are the parameters a 
and 8 identified in this case? Why or why not? 

(c) In the case of -y = 0, what formula would you use to estimate 0? 
What is the asymptotic variance of your estimator of 0? 

5. What is meant by the phrase: “The estimator is invariant to normaliza¬ 
tion”? Do any problems arise if an estimator is not invariant to normali¬ 
zation? Which of the following estimation methods gives estimators that 
are invariant to normalization? 

(a) Indirect least squares. 

(b) 2SLS. 

(c) Instrumental variable methods. 

(d) LIML 

Explain how you would choose the appropriate normalization (with respect 
to quantity or price) in a demand and supply model. 

6 . The structure of a model with four endogenous and three exogenous vari¬ 
ables is as follows: (1 indicates presence and 0 absence of the variable in 
the equation) 


1 

0 

1 1 

1 

0 

0 

1 

1 

1 0 

0 

1 

1 

0 

0 

1 0 

1 

0 

0 

1 

0 

1 1 

0 

1 

0 

Which of the four equations are identified? 



7. Explain how you would compute R 2 in simultaneous equations estimation 
methods. 
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8 . Examine whether each of the following statements is true (T), false (F), or 
uncertain (U), and give a short explanation. 

(a) In a simultaneous equation system, the more the number of exoge¬ 
nous variables the better. 

(b) If the multiple correlations of the reduced-form equations are nearly 
1, the OLS and 2SLS estimates of the parameters will be very close 
to each other. 

(c) In the 2SLS method we should replace only the endogenous vari¬ 
ables on the right-hand side of the equation by their fitted values 
from the reduced form. We should not replace the endogenous vari¬ 
able on the left-hand side by its fitted value. 

(d) Which variables should be treated as exogenous and which as en¬ 
dogenous cannot be determined from the data. 

(e) An estimation of the demand function for steel gave the price elas¬ 
ticity of steel as +0.3. This finding should be interpreted to mean 
that the price elasticity of supply is at least +0.3. 

(f) Any equation can be made identified by deleting enough exogenous 
variables from the equation or adding enough exogenous variables 
to the other equations. 

(g) Any variable can be endogenous in one equation and exogenous in 
another equation. 

(h) Some simultaneous equation systems can be estimated by ordinary 
least squares. 

(i) If the R 2 from the 2SLS method is negative or very low and the R 2 
from the OLS method is high, it should be concluded that something 
is wrong with the specification of the model or the identification of 
that particular equation. 

(j) The R 2 from the OLS method will always be higher than the R 2 from 
the 2SLS method, but this does not mean that the OLS method is 
better. 

(k) In exactly identified equations, the choice of which variable to nor¬ 
malize does not matter. 

(l) In exactly identified equations, we can normalize the equation with 
respect to an exogenous variable as well. 

9. Consider the model 


Q ~ p 0 + [3,/^, + (3 2 T + u demand function 

Q - a 0 + a,P K . + a 2 S + u supply function 

Estimate this model by 2SLS, instrumental variable, and indirect least 
squares methods using the data in Table 9.2 (transform all variables to 
logs). Would you get different results using the three methods? 

How would you choose the appropriate normalization (with respect to 
Q or P„)? Does normalization matter in this model? 

10. Estimate the supply equation in the model of the Australian wine industry 
in Section 9.5 if it is normalized with respect to P w . 
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Appendix to Chapter 9 

Necessary and Sufficient Conditions for Identification 
(Section 9.4) 

Consider a simultaneous equations model with g endogenous variables and k 
exogenous variables. In matrix notation we can write the model as 

By,+ Tx, = u, t = 1,2, . . . , T (9A.1) 

where y, = a g x 1 vector of observations on the endogenous variables 

x, = a k x 1 vector of observations on the exogenous variables 

u, = a g x 1 vector of errors 

B = a g x g matrix of coefficients of the endogenous variables 

T = a g x k matrix of coefficients of the exogenous variables 

In the example given in Section 9.4, g = 7 and k = 3. B is the 7 x 7 matrix 
consisting of the first seven columns, and T is the 7x3 matrix consisting of 
the last three columns. 

We assume that the matrix B is nonsingular. Hence we can solve (9A.1) for 
y, to get 

y, = -B Tx, + B 'u, (9A.2) 

= IIx, + v, 

This equation is called the reduced form. Equation (9A.1) is called the struc¬ 
tural form. From (9A.2) we have 

-B~T = ri or BH + r = 0 and v, = B \ (9A.3) 

We assume that the errors u, have zero mean, are independent, and have a 
common covariance matrix E(u,u',) = X. 

To discuss identification, without any loss of generality, consider the first 
equation in (9A.1). Let 0' be the first row of B and y' the first row of E. Par¬ 
tition these vectors each into two components corresponding to the included 
and excluded variables in this equation. We have 

0 ' = I3ipa 
y' = [ 7 I 72 I 

0; corresponds to g, included and 0J corresponds to g 2 excluded endogenous 
variables (g, + g 2 = g). Similarly, 7 ', corresponds to k y included and y ’ 2 to k 2 
excluded exogenous variables (k, + k 2 = k). In the first equation in the ex¬ 
ample in Section 9.4, we have g, = 3, g 2 = 4, k x = 1, and k 2 = 2. 

Now partition the matrices B and f also conformably to the partitioning of 
0 and y. We have 
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Consider the matrix 



D is the matrix corresponding to the missing endogenous and exogenous vari¬ 
ables. The necessary and sufficient condition for identification, also known as 
the rank condition, is 

Rank(D) = g - 1 

The proof of this proposition follows from noting that if rank [B 2 r 2 ] < g - 
1, there will exist a nonnull vector a' [ B 2 r 2 ] = [0 0], In this case we can 
find a linear combination of the (g — 1 ) equations, with coefficients given by 
the elements of a, which, when added to the first equation, results in an equa¬ 
tion that “looks like” it. Thus it is not possible to identify the parameters of 
the first equation. 

Methods of Estimation (Sections 9.4 and 9.5) 

We shall consider the estimation of a single equation by least squares methods. 
Let the particular equation consisting of g, endogenous variables and k x exog¬ 
enous variables be written as 

y = Y,B + X ,7 + u = Z ,8 + u (9A.4) 

where y = T x 1 vector of observations on the endogenous variable 
chosen for normalization (i.e., to have coefficient 1 ) 

Y, = T x (g, — 1) matrix of observations on the included 

endogenous variables 

X, = T x k t matrix of observations on the included 
exogenous variables 

8 ' = [P'y'] is the vector of parameters to be estimated 

Z, = [Y.X.J 

u = T x 1 vector of errors 
We assume that £(uu') = ct 2 I t . 

Let X be the T x k matrix of observations on all the exogenous variables in 
the system. Equation (4) is identified by the order condition only if the number 
of excluded variables is > g - 1; that is, g 2 + k 2 s (g — 1). We shall assume 
that this is satisfied. 

Since Y, and u are correlated, the OLS estimators of the parameters in (9A.4) 
are not consistent. To get consistent estimates, we use instrumental variables 
for Y,. Let us consider Y, where Y] is the predicted value of Y, from the re¬ 
duced-form equations. Then 

t, = XfX'Xr'X'Y, (9A.5) 

Also let V, be the estimated residuals from the reduced form, so that Y, = 
Y, + V,. Then we have X'V, = 0 (the residuals from an OLS estimation are 
uncorrelated with the regressors). Hence we get 
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X'Y, = X'Y, 


and 


Y',V, = 0 


(9 A. 6 ) 


We assume that the exogenous variables are independent of the errors so that 


0 and also that plim| ^X'X ] is a positive definite matrix (i.e., 
* 


plim| ^X'u 

V* 

there are no linear dependencies among the x’s and no degeneracies). 

These assumptions now enable us to prove that the instrumental variable 
(IV) estimator is consistent and also enable us to derive its asymptotic distri¬ 
bution. 

Define Z, = [Y,X,]. 

Then using the relations (9A.6) we can check that Z',Z, = Z,Z,. The IV es¬ 
timator of 8 in (9A.4) is 


8 IV = (Z'.Z.r'Zly = (Z'.Z.r'ZKZ.S + u) = 8 + (Z'.Z.J-'Z'.u 


(9A.7) 


plimj —X'u 


plim S IV = 8 + plim^ZjZ,^ . plim^Z',u^ = 8 , since plim^Z,u^ = 0 
and plim^ZjZ,^ is finite. These relations follow from the assumptions that 

' ^ 0 and plim^X'X^ is a positive definite matrix. 

Thus, as is expected, the IV estimator is consistent. The asymptotic covari¬ 
ance matrix of S IV is given by 

AE 7T$,v - 8)(S IV - 8)' 

where AE denotes asymptotic expectation. It is customary to assume that we 
can substitute plim for AE. This gives the asymptotic covariance matrix as 

plim r(S IV - 8 )( 8 1V - 8 )' 

and using (9A.7) we get this = plim^Z,Z,^ . plim^Ziuu'Z, 


(9A.8) 


(9A.9) 


plim^ZlZ,) = o 2 plim^Z, 

since £(uu') = c 2 I r and ZJZ, = ZJZ,. In practice we estimate <r 2 by 

(y - Y,p - X, 7 )'(y - Y,p - X, 7 ) 

= - - - } - 

T - g t ~ k t 


Also, note that Y',Y, = Y;mMY, = Y;MY„ where M = X(X'X)“'X' is 
an idempotent matrix and YJX = YJX. Also, in practice we estimate 

plimf^ZiZ,) by 


= \y\my 1 y;x ,1 1 

\T 1 l j [ X ' Y > X ' X J 
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In practice, we consider the variance of 8 IV (not \/j S, v ) and we write 

r- -i-i 


var(8, v ) = cr 2 


y;my, 

x;y, 


y;x, 

x;x, 


and estimate cr 2 by a 2 given by (9A.9). 

In the 2SLS estimation method, we use Y, as regressors rather than instru¬ 
mental variables; that is, we substitute Y, for Y, on the right-hand side of (9A.4) 
and estimate the equation by OLS. The equation is 

y = Y,p + X l7 + (u + V,0) 

= Z,8 + (u + V,0) 

^ 2 sls = (Z,Z,) 1 Z,y 

= (z;z,) 'zi(z,8 + u + v,0) 

= 8 + (ZlZJ^'ZfU since Z',V, = 0 
Since Z;Z, = ZJZ„ it follows from (9A.7) that 

§iv = 8 2 sls 

That is, it does not make any difference whether Y, is used as a regressor or as 
an instrument. This shows that the 2SLS estimator is also an IV estimator. 

We can also show that the 2SLS estimator is the best IV estimator. We shall, 
however, omit the proof here . 25 


25 For a proof, see Maddala, Econometrics, p. 477. 
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10.1 Introduction 


Expectations play a crucial role in almost every economic activity. Production 
depends on expected sales, investment depends on expected profits, long term 
interest rates depend on expected short-term rates, expected inflation rates. 
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and so on. It is, therefore, important to study models of expectations and how 
these models are estimated. 

In the following sections we study three different models of expectation for¬ 
mation: 

1. Naive models of expectation. 

2. Adaptive models of expectation. 

3. Rational expectations models. 

In each case we concentrate on the econometric problems involved. 

Since expectations play an important role in economic activity, there are 
many surveys conducted by different organizations to find out what con¬ 
sumers’ expectations are regarding different economic variables. For instance, 
the Survey Research Center at the University of Michigan conducts surveys 
regarding consumers' attitudes toward purchases of different durable goods 
and their forecasts about future inflation rates. There are now survey data avail¬ 
able on forecasts of a number of economic variables: wages, interest rates, 
exchange rates, and so on. An important question that arises regards how we 
can make use of these survey data. Many econometricians have also investi¬ 
gated the question of how well these survey data forecast the relevant eco¬ 
nomic variables. After discussing the three models of expectations mentioned 
above, we discuss the usefulness of survey data on expectations. 


10.2 Naive Models of Expectations 


The earliest models of expectations involved using past values of the relevant 
variables or simple extrapolations of the past values, as measures of the ex¬ 
pected variables. Consider, for instance, an investment equation 

y, = a + bx* l+{ + u, (10.1) 

where y, = investment in period t 

x] + , = expected profits during period t + 1 
u, = error term 

Unless otherwise noted, expectations are formed in the previous time pe¬ 
riod. Thus x] denotes expectations of profits for period t as formed at period 
t — 1. Let x, be the actual profits for period t. Then a naive model for x] + , is 

-Cl = X, (10.2) 

That is, the firm believes that the profits next period will be the same as profits 
this period. A simple extrapolative model would be to say that profits next 
period will increase by the same amount as the latest increase. This gives 

*; +l - x, = x, - x,_ t 
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or 


2x, 


(10.3) 


Another extrapolative model would be to say that profits will increase by the 
same percentage as the latest rate of increase. This gives 


or 



(10.4) 


In all these cases we estimate equation (10.1) after substituting the relevant 
formula for x * from (10.2), (10.3), or (10.4). Since the formula for x* t is derived 
from outside and does not consider the equation (10.1) to be estimated, these 
expectations are considered exogenous (derived from outside the economic 
model under consideration). We will, in the following sections, discuss cases 
where expectations are endogenous (i.e, derived by taking account of the eco¬ 
nomic model we are considering). 

The previous formulas for x, need to be changed suitably if we have quarterly 
data or monthly data. In these cases there are quarterly or monthly fluctua¬ 
tions, called seasonal fluctuations. For instance, December sales this year 
would be comparable to December sales last year because of the Christmas 
season. Hence formula (10.4) would be written as 


x,+ t _ X, 

for quarterly data 

-q-3 

4 


X, + l X, 

for monthly data 

x,-n 

x,- n 


where, again, x] denotes expected profits and x, actual profits. Note that we 
compare the corresponding quarters or months and take the most recent per¬ 
centage gain as the benchmark. 

Formulas like (10.5) were used by Ferber 1 to check the predictive accuracy 
of railroad shippers’ forecasts with actual values. He compared the shipper’s 
forecasts with actual values and also forecasts given by formula (10.5) with the 
actual values and found that the railroad shippers’ forecasts were worse than 
those given by the naive formula. For the comparison he used average absolute 
error AAE given by 


AAE = - y lactual 
n 


predicted | 


'Robert Ferber, The Railroad Shipper’s Forecasts (Urbana, Ill.: Bureau of Economic Research, 
University of Illinois, 1953). 
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Hirsch and Lovell 2 did a similar study based on Manufacturers’ Inventory 
and Sales Expectations Survey of the Office of Business Economics, U.S. De¬ 
partment of Commerce. However, they found that the anticipations data were 
more accurate than predictions from naive models. 

Yet another naive formula that is often used, and used by Ferber, is that of 
regressive expectations. In this formula there are two components: 

1. A growth component based on recent growth rates as in (10.5). 

2. A return-to-normal component, also called the regressivity component. 

Formula (10.5) would now be written as 



a is the “reversal coefficient.” Ferber estimated an equation like (10.6) using 
the railroad shippers’ forecasts for x, and the actual values for x,. He found an 
estimate of p = 0.986 and a = 0.556 and concluded that expectations were 
regressive. Hirsch and Lovell also found that the data they considered also 
showed regressivity in expectations, but they argue that this is because the 
actual data are also regressive. 

The naive models that we have considered are by no means recommended. 
However, they are often used as benchmarks by which we judge any survey 
data on expectations. 


10.3 The Adaptive Expectations Model 


The models considered in Section 10.2 use only a few of the past values in 
forming expectations. Some other models use the entire past history, with the 
past values receiving declining weights as we go farther into the distant past. 
These models are called distributed lag models of expectations. Consider 

*‘+l = 3o*r + 3l*r 1 + • ' • + 3**,-* (10.7) 

This is called a finite distributed lag since the number of lagged (past) values is 
finite. p 0 , p„ . . . , p t are the weights that we give to these past values. The 
naive model (10.2) corresponds to p 0 = 1 and p, = p 2 = • ■ • = 3* = 3- Dis¬ 
tributed lags like (10.7) have a long history. Irving Fisher 3 was perhaps the first 
one to use them. He suggested arithmetically declining weights 

_ f (k + 1 - i)p for 0 < / < k 
“ [0 for i > k 

2 A. A. Hirsch and M. C. Lovell, Sales Anticipations and Inventory Behavior (New York: Wiley. 
1969). 

3 I. Fisher, “Our Unstable Dollar and the So-Called Business Cycle," Journal of the American 
Statistical Association, Vol. 20, 1925; also I. Fisher, “Note on a Short-Cut Method for Calcu¬ 
lating Distributed Lags,” Bulletin of the International Statistical Institute, Vol. 29, 1937, pp. 
323-327. 
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Thus the weights are (k + 1)3, (k - 1)3, (k - 2)3, and so on. The sum of 

the weights is 

(k + m + 2) 

P 2 

We might want to restrict this sum to 1. The only problem with this is that if 
there is a trend in x„ say jc, is increasing over time, then x* + , given by (10.7) 
will continuously underpredict the actual values. We can make adjustment for 
this by multiplying x ],, by (1 + g), where g is the average growth rate of x,. 
Thus in using distributed lag models we make adjustments for the growth rate 
observed in the past [which is actually the idea in formulas like (10.4)]. 

The distributed lag models received greater attention in the 1950s when 
Koyck , 4 Cagan , 5 and Nerlove 6 suggested using an infinite lag distribution with 
geometrically declining weights. Equation (10.7) will now be written as 

os 

*;+. = 2 p,*,., (io. 8 ) 

1 = 0 

If 3, are geometrically decreasing we can write 

3, = 3 0 \' 0 < X < 1 

The sum of the infinite series is 3t/l ~ k, and if this sum is equal to 1 we should 
have 3 0 = 1 - X. Thus we get 

*;+, = 2 (1 ~ MX'jc,_, (10.9) 

/=o 

Figure 10.1 shows the graph of successive values of 3,- There is one interesting 
property with this relationship. Lag equation (10.9) by one time period and 
multiply by X. We get 

OS CO 

kx; = x 2 a - x)x'x,_,_ 1 = 2 (i - k)X' + v,_. 

1=0 i ~0 

Substituting j = i 4 * 1, we get 

* 00 

kx* = 2 0 - k)X^x,_, (10.10) 

Subtracting ( 10 . 10 ) from (10.9) we are left with only the first term on the right- 
hand side of (10.9). We thus get 

“L. M. Koyck, Distributed Lags and Investment Analysis (Amsterdam: North-Holland, 1954). 
A thorough discussion of the Koyck model can be found in M. Nerlove. Distributed Lags and 
Demand Analysis, U.S.D.A. Handbook 141 (Washington, D.C.: U.S. Government Printing Of¬ 
fice. 1958). 

'Phillip D. Cagan, “The Monetary Dynamics of Hyperinflations,” in M. Friedman (ed.). Studies 
in the Quantity Theory of Money (Chicago: University of Chicago Press, 1956), pp. 25-117. 
‘Marc Nerlove, The Dynamics of Supply: Estimation of Farmers' Response to Price (Baltimore: 
The Johns Hopkins Press, 1958). 
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l 

Figure 10.1. Geometric or Koyck Lag. 


xj +1 — XxJ = (1 — X)x, (10.11) 

or 



revision in last period’s 

expectation error 


Equation (10.12) says that expectations are revised (upward or downward) 
based on the most recent error. Suppose that x* was 100 but x, was 120. The 
error in prediction or expectation is 20. The prediction for (t + 1) will be re¬ 
vised upward but by less than the last period’s error. The prediction will, there¬ 
fore, be > 100 but < 120 (since 0 < X < 1). This is the reason why the model 
given by (10.9) is called the adaptive expectations model (adaptive based on 
the most recent error). 

Again, since the coefficients in (10.9) sum to 1, if there is a trend in x„ the 
formula for x* + , has to be adjusted so that x ] + , is multiplied by (1 + g), where 
g is the average growth rate in x,. Otherwise, the adaptive expectations model 
can continuously underpredict the true value. 


10.4 Estimation with the Adaptive 
Expectations Model 

Consider now the estimation of the investment equation (10.1) where the ex¬ 
pected profits x* +l are given by the adaptive expectations model (10.9). We can 
substitute (10.9) in (10.1) and try to estimate the equation in that form. This is 
called estimation in the distributed lag form. 
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Alternatively, we can try to use equation (10.11) to eliminate the unobserved 
x ] + , and estimate the resulting equation. This is called estimation in the auto¬ 
regressive form. Since this is easier, we will discuss this first. 


Estimation in the Autoregressive Form 

Consider equation (10.1), which we want to estimate. 

y, = a + bx* t+l + u, (10.1) 

Lag this equation by one time period and multiply throughout by A.. We get 

Ay,_, = ak + b\x* + Ku,_, (10.13) 

Subtracting (10.13) from (10.1) and using the definition of the adaptive expec¬ 
tations model as given in (10.11), we get 

y, - Ay,_, = a( 1 - A) + b(x] +l - Ax*) + u, - 

= a( 1 - A) + b( 1 - \)x, + u, — Aw,_, 


or 


y, — a' + Ay,_, + b'x, + v, (10.14) 

where a' = a( 1 - A), b' = b( 1 - A), and v, = u, - A«,_We have eliminated 
the unobserved x ‘ l+ , and obtained an equation in the observed variable x,. 

Since equation (10.14) involves a regression of y, on y,_j, we call this the 
autoregressive form. One can think of estimating equation (10.14) by ordinary 
least squares and, in fact, this is what was done in the 1950s and that accounted 
for the popularity of the adaptive expectations model. However, notice that the 
error term v, is equal to u, - Ku,_ , and is thus autocorrelated. Since y,_, in¬ 
volves u,_ u we see that y,_, is correlated with the error term v,. Thus estima¬ 
tion of equation (10.14) by ordinary least squares gives us inconsistent esti¬ 
mates of the parameters. 

What is the solution? We can use the instrument variable method. Use jc,_, 
as an instrument for y,_,. Thus the normal equations will be 

2 x,v, = 0 and X x >-i U = 0 

The other alternative is to take the error structure of v, explicitly into account 
(note that it depends on A). But this amounts to not making the transformation 
we have made and thus is exactly the same as estimation in the distributed lag 
form, which is as follows. 


Estimation in Distributed Lag Form 

Substituting the expression (10.9) in (10.1), we get 


y, = a + b 


x 

Id - A)A'x,_, + u, 

i — 0 
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Since this involves an infinite series and we do not observe the infinite past 
values of x„ we have to tackle this problem somehow. What we do is to break 
up the series into the observed and unobserved past. We will write the infinite 
series as 

/—I oc 

So - mx**,., + X a - mx'*,_ ( 

i—0 i-t 

The first part is observed and we will denote it by z„(X). We use X since it 
depends on X. The second part can be written as 

oc ee. 

X'X (1 - = X' X (1 ~ X)X J x J (writing./ = i - t) 

i=-l j = o 

Notice that the second part of this expression is nothing but (10.9) with t = 0, 
that is (the expected price for the first period). If we treat this as an unknown 
parameter c and define z 2 , = X', we can write 

X‘,+ 1 = Zlr + CZ 2l 

Thus equation (10.1) that we wish to estimate can be written as 

y, — a + bx t i | + u, 

= a + b(z u + cz 2t ) + u, 

= a + bz u + c'z 2 , + u t (10.15) 

with c' = be. Note that z„ and z 2 , depend on X. Actually, we are not interested 
in the parameter c'. The estimation proceeds as follows. 

For each value of X in the range (0, 1) we construct the variables 

t— I 

Zu = X 0 - X)x%_, 

t = C 

and z 2 , = X'. Thus 

Zu a» (1 - X)x, 

Z |2 = (1 - k)(x 2 + XX,) 

Zi 3 == (1 - X)(x 3 + Xx 2 + X 2 x,) 

and so on. We estimate equation (10.15) by ordinary least squares and obtain 
the residual sum of squares. Call this RSS(X). We choose the value of X for 
which RSS(X) is minimum and obtain the corresponding estimates of a and b 
as the desired least squares estimates. 

Note that since z u and z 2t are nonlinear functions of X, the estimation of 
(10.15) involves nonlinear least squares. What we have done is to note that for 
given X we have a linear least squares model. Thus we are using a search pro¬ 
cedure over X. In practice one can choose X in intervals of 0.1 in the first stage 
and then intervals of 0.01 in the second stage. 



10.5 TWO ILLUSTRATIVE EXAMPLES 


413 


10.5 Two Illustrative Examples 


As mentioned earlier, the adaptive expectations model was used by (among 
others) Nerlove for the analysis of agricultural supply functions, and by Cagan 
for the analysis of hyperinflations. The hyperinflation model involves more 
problems than have been noted in the literature, and hence we will discuss it 
in greater detail. In the analysis of agricultural supply functions, we have to 
deal with the expectation of a current endogenous variable p„ whereas in the 
hyperinflation model, we have to deal with the expectation of a future endog¬ 
enous variable p, +i . 

Consider the supply function 

Q t = CL + 0p,* + yz, + U, 

where p * is the price expected to prevail at time t (as expected in the preceding 
period), Q, the quantity supplied, z., an exogenous variable, and u, a disturbance 
term. The adaptive expectations model implies that 

P ‘ - XjV-i = (1 - X)p,_, 

Thus 

Q, ~ Xf?,- 1 = a - aX + P<>; - Xp,*_,) + y(z, - Xz,_,) + u, - \u,^, 
Eliminating p *, we get 

Q, = a(l - X) + KQ, , + 0(1 - X)p,., + yz, - Kyz, , + u, - Xw,_, 

This equation can be estimated by the methods described in Section 10.4. Com¬ 
putation of the estimates for this model using the data in Table 9.1 is left as an 
exercise. We will consider Cagan’s hyperinflation model in greater detail be¬ 
cause it involves more problems than the agricultural supply model. 

The hyperinflation model says that the demand for real cash balances is in¬ 
versely related to the expected rate of inflation. That is, the higher the expected 
rate of inflation, the lower the real cash balances that individuals would want 
to hold. Of course, the demand for real cash balances depends on other vari¬ 
ables like income, but during a hyperinflation the expected inflation rate is the 
more dominant variable. We specify the demand for money function as 

m, - p, = a + b(p* l+t — p,) + u, b < 0 

where m, is the log of the money supply, p, the log of the price level, p* + , the 
expectation of p t+l as expected at time t, and u, the error term. Since the vari¬ 
ables are in logs, p* t+l - p, is the expected inflation rate. We will denote it by 
ir* +1 . The actual rate of inflation is tt, + 1 = p, + , — p,. Let us define y, = m, — 
p t . Then the demand for money model can be written as 


y, = a + 6 'ir,‘ + i + u, 
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This is similar to equation (10.1) and the estimable equation we derived in Sec¬ 
tion 10.4 is equation (10.14). This equation applies here with it, in place of x t . 
The equation we get is 

y, = a( 1 - X) + Xy,_, + b( 1 - X)it, + v, (10.14') 

where v, = u, — X«, ,. Cagan estimated this equation by OLS but the problem 
with the OLS estimation of this equation is that it gives inconsistent estimators 
because the equation involves the lagged dependent variable y, and the errors 
are serially correlated. One can use Klein’s method or estimation in the dis¬ 
tributed lag form to avoid this problem. The equation we estimate is similar to 
equation (10.15) and is 

y, = a + bz u + cz 2 , + u, (10.15') 

where 

r*-l 

Z,, = 2 (1 - X)X'tt,_, 

i = 0 

and z 2 , - X'. Thus for each value of X we generate z„ and z 2 , and estimate 
(10.15') by OLS. We choose the value of X for which the residual sum of 
squares is minimum, and the corresponding estimates of a, b, and c. 

However, even though this is the correct method of estimation for the agri¬ 
cultural supply functions, it is not the correct method for the hyperinflation 
model. The reason is that the model is one where money supply m, is the ex¬ 
ogenous variable and price level p, is the endogenous variable. By defining the 
variable y, = m, — p, and writing the equation in a form similar to (10.1) we 
have missed this point. Equation (10.15') cannot be estimated by OLS because 
it, = p, - p, , in z„ is correlated with the error term u t . 

We can solve this problem by moving the variable p, to the left-hand side. 
Let us define W, = z„ - (1 - X)p,. Now W, involves p, , and higher-order 
lagged values of p, and does not involve current p,. We can write equation 
(10.15') as 

m, - p, - a + b( 1 - X)p, + bW, + cz 2 , + u, 

Collecting the coefficients of p, and simplifying, we can write this equation as 
p, = 0 O + 01 ^ 2 , + 0 2 w, + e 3 z 2 , + v, (io.i5") 

where 

0 = fl = 1 = ~ b 

0 1 + b(l - X) 1 1 + 6(1 - X) 2 1 + b(l - X) 

e = ~ c = ~ u > 

3 1 + 6(1 - X) V ' 1 + b( 1 - X) 

Thus the appropriate equation to estimate for the estimation of the hyperinfla¬ 
tion model under adaptive expectations is equation (10.15") and not equation 



10 6 EXPECTATIONAL VARIABLES AND ADJUSTMENT LAGS 


415 


(10.15') or the simple OLS estimation of the autoregressive form (it). 14'). Both 
equations (10.14') and (10.15') ignore the fact that the same endogenous vari¬ 
able p, occurs on both the left-hand side and the right-hand side of the equation. 
Equation (10.14') can also be rearranged similarly. We rewrite it as 

m t — p, — a{ 1 - X) + X(m,_, - />,_,) + b( 1 - K)(p, - />,_,) + v, 

We have to gather the coefficients of p„ normalize the equation with respect to 
p„ and estimate it using p, as the dependent variable and m„ m, ,, and p, , as 
the explanatory variables. The problem of lagged dependent variables and se¬ 
rially correlated errors still remains and we have to estimate the equation by 
instrumental variables using, say, m, . 2 as an instrument. 

In summary, the estimation of the hyperinflation model under adaptive ex¬ 
pectations is not as straightforward as it appears at first sight looking at equa¬ 
tion (10.14') or (10.15'). An important aspect of the model under adaptive 
expectations (and also some naive expectations) that has not been noticed often 
is the occurrence of the endogenous variable p, on the right-hand side of the 
equation, which makes OLS estimation inapplicable unless some rearranging 
of the variables is made. The appropriate dependent variable should be p„ and 
the explanatory variables are m, and lagged values of m, and p,. 

Since the estimation of the hyperinflation model under adaptive expectations 
is only of historic interest, we will not present the results here. The estimation 
of equations (10.14'), (10.14"), (10.15'), and (10.15") is left as an exercise for 
students. Tables 10.1 and 10.2 on the following pages present the data for Hun¬ 
gary and Germany, respectively. The last periods (particularly since June 1924) 
can be omitted in the analysis. The data have been provided here because the 
same data can be used for the estimation of the rational expectations model 
discussed in Section 10.10. 


10.6 Expectational Variables and 
Adjustment Lags 

In previous sections we discussed models in which the expectational variables 
were functions of lagged values of the relevant variables for which expectations 
were formed. There is another source of lagged relationships. This is lags in 
adjustment to desired levels. In practice both these lags will be present and 
there is an econometric problem of isolating (or identifying) the separate effects 
of these two sources. 

A simple model incorporating adjustment lags is the partial adjustment 
model, which says that firms adjust their variables (say, capital stock) only 
partially toward their desired levels. Suppose that a firm anticipates some 
change in the demand for its product. In anticipation of that, it has to adjust its 
productive capacity or capital stock denoted by y,. But it cannot do this im¬ 
mediately. Let yj be the “desired” capital stock. Then yf — y, i is the desired 
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Table 10.1 

Price Index and Money Supply for Hungary, 1921-1925" 



Notes in 
Circulation 

Current Accounts 
and Deposits 

Price Index 

1921 

Jan. 

15.21 

3.85 


Feb. 

15.57 

5.53 

— 

Mar. 

15.65 

5.25 

— 

Apr. 

13.12 

6.80 

— 

May 

13.69 

5.76 

— 

June 

18.10 

1.16 

— 

July 

15.80 

3.53 

4.20 

Aug. 

17.33 

2.98 

5.40 

Sept. 

20.84 

2.41 

6.25 

Oct. 

23.64 

2.15 

6.75 

Nov. 

24.74 

2.35 

8.30 

Dec. 

25.18 

2.24 

8.25 

1922 

Jan. 

25.68 

2.49 

8.10 

Feb. 

26.76 

2.35 

8.50 

Mar. 

29.33 

2.22 

9.90 

Apr. 

30.58 

2.90 

10.75 

May 

31.93 

3.29 

11.00 

June 

33.60 

3.74 

12.90 

July 

38.36 

3.93 

17.40 

Aug. 

46.24 

5.42 

21.40 

Sept. 

58.46 

5.93 

26.60 

Oct. 

70.00 

5.19 

32.90 

Nov. 

72.02 

6.41 

32.60 

Dec. 

75.89 

4.76 

33.40 

1923 

Jan. 

73.72 

5.89 

38.50 

Feb. 

75.14 

6.60 

41.80 

Mar. 

82.21 

11.15 

66.00 

Apr. 

100.10 

9.79 

83.50 

May 

119.29 

10.61 

94.00 

June 

155.00 

12.74 

144.50 

July 

226.29 

21.98 

286.00 

Aug. 

399.49 

23.63 

462.50 

Sept. 

588.81 

60.25 

554.00 

Oct. 

744.93 

60.18 

587.00 

Nov. 

853.99 

74.97 

635.00 

Dec. 

931.34 

84.79 

714.00 

1924 

Jan. 

1084.70 

105.48 

1026.00 

Feb. 

1278.40 

164.84 

1839.10 
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Table 10.1 ( Cont.) 



Notes in 
Circulation 

Current Accounts 
and Deposits 

Price Index 

Mar. 

1606.90 

253.90 

2076.70 

Apr. 

2098.10 

308.10 

2134.60 

May 

2486.30 

527.10 

2269.60 

June 

2893.70 

1135.70 

2207.80 

July 

3277.90 

1424.60 

2294.50 

Aug. 

3659.80 

1473.20 

2242.00 

Sept. 

4115.90 

1416.40 

2236.60 

Oct. 

4635.10 

1465.30 

2285.20 

Nov. 

4442.60 

1929.80 

2309.50 

Dec. 

4514.00 

2069.50 

2346.60 

1925 

Jan. 

4449.60 

2138.60 

2307.50 

Feb. 

4238.00 

2542.30 

2218.70 

Mar. 

4270.10 

2552.80 

2177.80 

Apr. 

4526.20 

2470.50 

— 

"Money supply m 10 9 kronen. 

Source. John Parke Young, European Currency and Finance, Commission of Gold and Silver 

Inquiry, U.S. Senate, Serial 9, Vol. 2, U.S. 

Government Printing Office, Washington, D.C., 

1925. 

Table 10.2 

Prices and Money Supply in Germany, 1921-1924" 



Notes in 

Circulation 

Total Demand Deposits 

Wholesale 
Price Index 

1921 

Jan. 

66.62 

15.83 

1.44 

Feb. < 

67.43 

17.36 

1.38 

Mar. 

69.42 

28.04 

1.34 

Apr. 

70.84 

20.86 

1.33 

May 

71.84 

14.09 

1.31 

June 

75.32 

20.39 

1.37 

July 

77.39 

15.82 

1.43 

Aug. 

80.07 

13.65 

1.92 

Sept. 

86.38 

19.98 

2.07 

Oct. 

91.53 

18.30 

2.46 

Nov. 

100.90 

25.31 

3.42 

Dec. 

113.60 

32.91 

3.49 

1922 

Jan. 

115.40 

23.42 

3.67 

Feb. 

120.00 

26.53 

4.10 

Mar. 

130.70 

33.36 

5.43 




(cont'd) 
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Table 10.2 ( Corn.) 



Notes in 
Circulation 

Total Demand Deposits 

Wholesale 
Price Index 

Apr. 

140.40 

31.62 

6.36 

May 

151.90 

33.13 

6.46 

June 

169.20 

37.17 

7.03 

July 

189.80 

39.98 

10.16 

Aug. 

238.10 

56.12 

19.20 

Sept. 

316.90 

110.00 

28.70 

Oct. 

469.50 

140.80 

56.60 

Nov. 

754.10 

241.00 

115.10 

Dec. 

1,280.00 

530.50 

147.50 

1923 

Jan. 

1,984.00 

763.30 

278.50 

Feb. 

3,513.00 

1,583.00 

588.50 

Mar. 

5,518.00 

2,272.00 

488.80 

Apr. 

6,546.00 

3,854.00 

521.20 

May 

8,564.00 

5,063.00 

817.00 

June 

17,290.00 

9,953.00 

1,938.50 

July 

43,600.00 

27,857.00 

7,478.70 

Aug. 

663,200.00 

591,080.00 

94,404.00 

Sept. 

2,823 x 10 4 

1,697 x 10 4 

2,395 x 10 3 

Oct. 

2,497 x 10 6 

3,868 x 10 6 

709.5 x 10 6 

Nov. 

4,003 x 10 s 

3,740 x 10 8 

72.6 x 10 9 

Dec. 

4,965 x 10 8 

5,480 x 10 8 

126 x 10 9 

1924 

Jan. 

4,837 x 10 8 

2,813 x 10 8 

117 x 10 9 

Feb. 

5,879 x 10 8 

6,505 x 10 8 

116 x 10 9 

Mar. 

6,899 x 10 s 

7,047 x 10 8 

121 x 10 9 

Apr. 

7,769 x 10 8 

8,050 x 10 8 

124 x 10 9 

May 

9,269 x 10 8 

8,046 x 10 8 

122 x 10 9 

June 

10,970 x 10 8 

7,739 x 10 8 

116 x 10 9 

July 

12,110 x 10 8 

7,430 x 10 8 

115 x 10 9 

Aug. 

13,920 x 10 8 

5,619 x 10 8 

120 x 10 9 

Sept. 

15,210 x 10 8 

6,701 x 10 8 

127 x 10 9 

Oct. 

17,810 x 10 8 

7,087 x 10 8 

131 x 10 9 

Nov. 

18,630 x 10 8 

7,039 x 10 8 

129 x 10 9 

Dec. 

19,410 x 10 8 

8,209 x 10 8 

131 x 10 9 


"Money supply is for end of the month and in 10 9 marks. Since Jan. 1924 the Reischsbank 
reported money supply in reischmarks. 1 reischmark = 10' 3 old marks. All figures have been 
converted to old marks and rounded to four significant digits. Thus the money supply is all on 
a comparable basis for the whole period. 

Source: John Parke Young, European Currency and Finance, Commission of Gold and Silver 
Inquiry, U.S. Senate, Serial 9. Vol. 1, U.S. Government Printing Office, Washington. D.C., 
1925. 
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change. The partial adjustment model says that the actual change is only a 
fraction of the desired change, that is, 

y, - y,-\ = 8(yf - y ( _,) where 0 < 8 < 1 (10.16) 

Note that this equation is similar to the adaptive expectations model (10.12) 
with y, = x ' t+ , and yf = x,. Thus y, will be a distributed lag of yf with geo¬ 
metrically declining weights. Now suppose that yf is a function of anticipated 
sales x,; then we can write 

yf = ax, + e, where e, ~ IID(0, erg) 

and combining the two equations we get 

y, - = Wax, + e, - y,_,) 


or 


y, = (1 - 8)y,_, + abx, + 8e, 

which can be written as 


y, = Pl^r-l + 02*, + U, 

with p, = 1 - 8, p 2 = a8, and u, = 8e,. Note that 0 < p, < 1 and we need this 
extra condition for a partial adjustment model. Also note that the properties of 
the error term u, are the same as those of e,. Thus the partial adjustment model 
does not change the properties of the error term. 

A simple explanation as to why firms make only a partial adjustment to the 
desired level is as follows. The firm faces two costs: costs of making the ad¬ 
justment and costs of being in disequilibrium. If the two costs are quadratic 
and additive, we can write total cost C, as 

C, = afy, - y,_,) 2 + a 2 (yf - ^r) 2 

Given y,_, and yf we have to choose y, so that total cost C, is minimum. 

dC 

= 0 gives 2fl,(y, - y,_,) = 2a 2 (yf - y,) 

= 2 a 2 \yi - y,_, - (y, - y,-i)] 

(y, ~ y,~i) = 8(yf - y,_,) 


a | + a 2 

Note that 0 < 8 < 1 and 8 is close to 1 if the costs of being in disequilibrium 
are much higher than the costs of adjustments. 8 is close to zero if the costs of 
adjustment are much higher than the costs of disequilibrium. 


Thus we get 

where 
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Partial adjustment models were popular in the 1950s and 1960s but were crit¬ 
icized as being ad hoc. 7 The desired level y * is derived independently by some 
optimization rule and then the adjustment equation is tagged on to it. However, 
the costs of adjustment and the costs of being in disequilibrium should be in¬ 
corporated in the optimization rule. There have been many attempts along 
these lines but they have not resulted in any tractable estimable equations. 8 

One refinement that can easily be done is to make the partial adjustment 
parameter 8 a function of some explanatory variables that are considered im¬ 
portant in determining the speed of adjustment (e.g., interest rates). Denoting 
the interest rate by r„ since 8, is supposed to be between 0 and 1 we can write 

e exp(a 0 + a,r ( ) 

U# . . , 

1 + exp(a 0 + a,r,) 

(we use a subscript t for 8 since it changes over time). 

However, there is no reason that the adjustment parameter be between 0 and 
1 at all times. In this case we can make 8, a linear function of r,. There are many 
empirical studies that use partial adjustment models with varying coefficients. 
Since we have explained the basic idea we will not review them here. 9 

A more generalized version of the partial adjustment model is the error cor¬ 
rection model. This model says that 

y, ~ y,-i = 8(y? - y? ,)+ -Y(yf_, - Y,-i) (10.17) 

- v 1 >■ ^ I ^ 

change in the past period's 

desired values disequilibrium 

where 0 < 8 < 1 and 0 < -y < 1. If 8 = -y, we have the partial adjustment model. 
Unlike the partial adjustment model, however, this model generates serially 
correlated errors in the final equation we estimate. Suppose that, as before, we 
write 


yf = ax, + e, 

where x, is anticipated sales. Then substituting this in (10.17), we get 

O', - y,-i) = (*80, - *,„,) + try*,-, - yy,_, + 8e ( - (8 - y)e,^, (10.18) 

The error term is now correlated with y,_, and we cannot estimate this equation 
by ordinary least squares. Again, we can think of using an instrumental variable 
(say, x,^ 2 for y, _,) and estimate this equation by instrumental variable methods. 

7 Z. Griliches, “Distributed Lags: A Survey,” Econometrica, January 1967, pp. 16-49, Sec. 5, 
“Theoretical Ad-Hockery. ” 

*See M. Nerlove, “Lags in Economic Behavior,” Econometrica, March 1972, pp. 221-251, for 
a survey of the work on adjustment costs and the development of a model with adjustment 
costs. There are, however, no empirical results in the paper. Nerlove says, “current research 
on lags in economic behavior is not ‘good’ because neither is the empirical research soundly 
based in economic theory nor is the theoretical research very strongly empirically oriented” 
(p. 246). 

’For a review of the model up to the early 1970s, see J. C. R. Rowley and P. K. Trivedi, Econ¬ 
ometrics of Investment (New York: Wiley, 1975), pp. 86-89 on “variable lags.” 
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During recent years error correction models have been more popular in em¬ 
pirical work than partial adjustment models. 10 The empirical evidence and eco¬ 
nomic interpretation is in favor of the former model. Also, one can have a test 
for the partial adjustment model by testing 8 = y in (10.18). 


10.7 Partial Adjustment with 
Adaptive Expectations 


When adjustment and expectational lags are combined we may have some 
problems of identifying their separate effects. We can see the problem by con¬ 
sidering the partial adjustment model with adaptive expectations. We can con¬ 
sider the error correction model as well but the partial adjustment model is 
simpler. 

Suppose that 

K ? = desired capital stock at the beginning of period t 

S * = expected sales in period t 

Kf = (B 0 + (B,S; + u, (10.19) 

The partial adjustment model states that 

K, - K,_, = 8 {Kf - K ,_,) 0 < 8 < 1 

and substituting for K d t , we get 

K, = p 0 8 + (1 - 8)*:,„, + (B,8S,‘ + 8 u, (10.20) 

If we use the adaptive expectations model 

s; - s;_, = \(s f _, - s;_,) o < a < 1 

then, lagging equation (10.20) by one period, multiplying it by (1 - X) subtract¬ 
ing this from (10.20), and simplifying, we get 


where 


K, = 8\p 0 + (1 - 8 + 1 - X)X f _, (10.21) 

- (1 - 8X1 - k)K ,~2 + k(B,8S,_, + v, 


v, = h[u, - (1 - X)«,„,] 

Now if we added the error term to the final simplified equation (10.21) rather 
than (10.20), it is easy to see from (10.21) that 8 and X occur symmetrically and 
hence there is some ambiguity in their estimates, which have to be obtained 
from the coefficients of A,_, and K, 2 . 

,0 An example of this is D. F. Hendry and T. von Ungern-Sternberg, “Liquidity and Inflation 
Effects on Consumers’ Expenditure,” in A. S. Deaton (ed.). Essays in the Theory of Measure¬ 
ment of Consumers' Behavior (Cambridge: Cambridge University Press, 1981). 
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Note that this ambiguity arises only if an error term is superimposed on the 
final simplified equation (10.21), and the equation is estimated by ordinary least 
squares assuming that the error terms are serially uncorrelated. On the other 
hand, we can estimate equation (10.20) in its distributed lag version." For this 
we use the procedures for the estimation of adaptive expectations models in 
the distributed lag form as described in Section 10.4. Thus if the model is es¬ 
timated in the distributed lag form, there is no ambiguity in the estimates of 8 
and k. 

The preceding discussion on partial adjustment models with adaptive expec¬ 
tations illustrates the point that specification of the error term cannot be done 
in a cavalier fashion. The estimation procedures and whether any parameters 
(like 8 and k in our example) are unambiguously estimable or not, depend on 
the specification of the error term at different stages of the modeling process. 

Of course, one can argue that there is no reason why the errors u, in (10.19) 
should be assumed serially independent. If for some reason one starts with 
(10.21) and a general specification for the error term v„ the ambiguity in the 
estimation of 8, the speed of adjustment, and k, the reaction of expectations, 
remains. However, in this case, if the equation (10.19) has some other explan¬ 
atory variables, the parameters k and 8 are identified, for example, suppose 
that (ignoring the error term which we will introduce at the end after all sim¬ 
plifications) 

K d , = Po + pa; + p 2 l, 

where L, is the amount of labor hired. Then, on simplification, and adding an 
error term v, at the end, we get 

K, = p 0 8k + (1 - 8 + 1 - k)/C,__, 

- (1 - 8)(1 - k)K,- 2 + P.8^,-1 
+ 8p 2 L, - Sp 2 (l - k)L t , + v, (10.22) 

In this equation 8 and k do not occur symmetrically. 

Suppose that we write the equation as 

K, = + a 2 K, , + a 2 K t2 + «A/.i + ol 5 L, + a 6 L,_ t + v, (10.23) 

Then 


But the problem is that we get two estimates of 8. From the coefficients of K ,_, 
we get 

8 = 2 — a 2 — k 


"This is equivalent to estimating (10.21) with a moving average error that depends on X, as 
specified in the equation for v,. 
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and from the coefficient of K r l we get 



The problem is that equation (10.23) has six parameters and our model has only 
five parameters (3 0 , (3,, (3 2 , A, and 8. Note, however, that given A, equation 
(10.22) can be written as 

K, = (3 0 8A + (1 - 8)Jt f _, + (3,8AS,„, + (3 2 8L, + v, (10.24) 

where 

K, = K,-( 1 - A)K,_, 

L, = L, — (1 — A)L,_! 

The estimation of (10.24) gives us unique estimates of 8, (3 0 , (3,, and (3 2 . Thus 
we can use the following two-step procedure: 

1. Estimate (10.23) and get an estimate of A. 

2. Use this to construct K, and L, and then estimate equation (10.24) to get 
unique estimates of 8, (3 0 , (3,, and (3 2 . 

An alternative search method is the following. Choose different values of A 
in the interval (0, 1). For each value of A run the regression of K, on K, ,, S,_,, 
and L,. Then the value of A for which the residual sum of squares is minimum 
is the best estimate of A and the corresponding estimates of 8, (3 0 , (3,, and (3 2 
are the desired estimates of the parameters. Actually, we can conduct the 
search in two steps, first at steps of 0.1 and then at steps of 0.01 around the 
minimum given in the first step. We have discussed a similar procedure in Sec¬ 
tion 10.4. These are all examples where an equation that is nonlinear in the 
parameters can be reduced to an equation linear in the parameters conditional 
on one of the parameters being given. 

10.8 Alternative Distributed Lag Models: 
Polynomial Lags 

We saw in previous sections that the adaptive expectations model (10.11) im¬ 
plies that the expectation x] ( , is a weighted average of x, and past values of x„ 
with geometrically declining weights. We also saw that the partial adjustment 
model (10.16) implies that y, is a weighted average of yf and past values of >f 
again with geometrically declining weights. Since the weights of the lagged 
variables all sum to 1, and they are usually all positive, it is customary to com¬ 
pare these weights to the successive terms in a probability distribution. The 
weights (3, in equations like (10.7) and (10.8) are said to form a lag distribution. 
The geometrically declining weights correspond to the geometric distribution. 
As we mentioned in Section 10.3, this type of lag is also called the Koyck lag, 
named after L. M. Koyck, who first used it. 




424 


10 MODELS OF EXPECTATIONS 


In addition to the geometric lag distribution, there are some alternative forms 
of lag distributions that have been suggested in the literature. We will discuss 
these with reference to: 

1. The finite lag distribution (10.7). 

2. The infinite lag distribution (10.8). 


Finite Lags: The Polynomial Lag 

Consider the estimation of the equation 


y, = Po*, + Pi*, i + • • • + Pa*,-a + U, (10.25) 

The problem with the estimation of this equation is that because of the high 
correlations between x, and its lagged values (multicollinearity), we do not get 
reliable estimates of the parameters p,. As we discussed in Section 10.3, Irving 
Fisher assumed the p, to be declining arithmetically. Almon 12 generalized this 
to the case where the p, follow a polynomial of degree r in /. This is known as 
the Almon lag or the polynomial lag. We denote this by PDL (k, r), where PDL 
denotes polynomial distributed lag, k is the length of the lag, and r the degree 
of the polynomial. For instance, if r = 2, we write 

P, = a 0 + a |i + a 2 / 2 (10.26) 

Substituting this in (10.25), we get 


y, 


= 2 («o + a,/ + a/)r,-, + u, 


®o£o» "h 0)Zi; + U 2 Z 2/ T M, 


(10.27) 


where 


k k k 

Zqj x [ _ [ Z\t %2t i *7—, 

, t’~0 t = 0 i = 0 

t- V 

Thus we regress y t on the constructed variables z 0 „ z,„ and z 2l . After we get the 
estimates of the a’s we use equation (10.26) to get estimates of the p,. 

Following the suggestion of Almon, often some “endpoint constraints” are 
used. For instance, if we use the constraints p= 0 and p t+l = 0 in (10.26), 
we have the following two linear relationships between the a’s [substituting 
i = - 1 and / = k + 1 in (10.26)]. 

a 0 — a, + a 2 = 0 and a n + a ,(/c + 1) + a 2 (k + l) 2 = 0 (10.28) 
These give the conditions 

a 0 = —a 2 (k + 1) and a, = —a 2 k (10.29) 


l2 S. Almon, “The Distributed Lag Between Capital Appropriations and Net Expenditures, 
Econometrica, January 1965, pp. 178-196. 
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Figure 10.2. Polynomial distributed lag. 


Thus we can simplify (10.27) to 

y, = a iZ, + ", 

where 

z, = 2 a 2 - ki - k - 1 

We get an estimate of a 2 by regressing y, on z, and we get estimates of a 0 and 
a, from (10.29). Using these we get estimates of p, from (10.26). 

Figure 10.2 shows a polynomial distributed lag. The curve shown is that 
where p, is a quadratic function of /. A quadratic function can take many shapes 
and it has been argued that the imposition of the endpoint restrictions (10.29) 
is often responsible for the “plausible” shapes of the lag distribution fitted by 
the Almon method. 13 Instead of imposing endpoint constraints a priori, one can 
actually test them because once equation (10.27) has been estimated, tests of 
the hypotheses like (10.28) are standard tests of linear hypotheses (discussed 
in Section 4.8). 

Many of the problems related to polynomial lags can be analyzed in terms of 
the number of restrictions that it imposes on p, in (10.25). For instance, with a 
quadratic polynomial we estimate three a’s, whereas we have (A + 1) p’s in 
(10.27). Thus there are A+l-3 = A- 2 restrictions on the P’s. With an rth- 
degree polynomial, we have (A - r ) restrictions. 

Suppose that we fit a quadratic polynomial for lag length A and lag length 
(A + 1). The residual sum of squares may increase or decrease. 14 The reason is 


13 P. Schmidt and R. N. Waud, “The Almon Lag Technique and the Monetary Versus Fiscal 
Policy Debate,” Journal of the American Statistical Association, March 1973, pp. 11-19. 
“This is demonstrated with an empirical example in J. J. Thomas, “Some Problems in the Use 
of Almon’s Technique in the Estimation of Distributed Lags,” Empirical Economics, Vol. 2, 
1977, pp. 175-193. 
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Figure 10.3. Long-tailed lag distribution. 


that linear restrictions are being imposed on two different parameter sets: (p,, 
p 2 , • • • 5 Pa) and (Pi, p 2 , • . • , Pa, Pa + 1 )- 
Apart from the problem of endpoint restrictions, there are three other prob¬ 
lems with polynomial lags: 

1. Problems of long-tailed distributions. It is difficult to capture long-tailed 
lag distributions like the one shown in Figure 10.3. This problem can be 
solved by using a piecewise polynomial. Another procedure is to have a 
polynomial for the initial p, and a Koyck or geometric lag for the latter 
part. 

2. Problem of choosing the lag length k. Schmidt and Waud suggest choos¬ 
ing k on the basis of maximum R 2 : Frost l ‘ i did a simulation experiment 
using this criterion and found that a substantial upward bias in the lag 
length occurs. As we discussed in Chapter 4, the R 2 criterion implies that 
a regressor is retained if the F-ratio is > 1. The bias that Frost suggests 
can be corrected by using F-ratios greater than 1, says F = 2. 

3. Problem of choosing r, the degree of the polynomial. If the lag length k 
is correctly specified, then choosing the degree of the polynomial r is 
straightforward. What we do is start with a sufficiently high-degree poly¬ 
nomial as in (10.27). Construct z 0t , z u , z 2 „ z 3 „ ... as defined in (10.27) 
and start dropping the higher terms sequentially. Note that the proper 
way of testing is to start with the highest-degree polynomial possible and 
then go backward, until one of the hypotheses is rejected. This sequential 
test was suggested by Anderson, 16 who showed that the resulting se- 

l5 P. A. Frost, “Some Properties of the Almon Lag Technique When One Searches for Degree 
of Polynomial and Lag," Journal of the American Statistical Association, Vol. 70, 1975, pp. 
606-612. 

I6 T. W. Anderson, “The Choice of the Degree of a Polynomial Regression as a Multiple Decision 
Problem,” Annals of Mathematical Statistics. Vol. 33, No. 1, 1966, pp. 255-265. This test is 
also discussed in T. W. Anderson, The Statistical Analysis of Time Series (New York: Wiley, 
1971), pp. 34-43. 



10 8 ALTERNATIVE DISTRIBUTED LAG MODELS' POLYNOMIAL LAGS 


427 


quence of tests are independent and that the significance level for the 7th 
test is 


i - n (1 - y.) 

1 = 1 

where y, is the significance level chosen at the /th step in the sequence. If 
-y, = 0.05 for all i, the significance levels for four successive tests are 0.05, 
0.0975 , 0.1426, and 0.1855, respectively. Godfrey and Poskitt 17 use this 
procedure suggested by Anderson to choose the degree of the polynomial 
for the Almon lag. 

We will now illustrate this procedure with an example. 

Illustrative Example 

Consider the data in Table 10.3. The data are on capital expenditures ( Y) and 
approximations (A) for the years 1953-1967 on a quarterly basis. These data 
are from the National Industrial Conference Board. Since they have been used 
very often (in fact, too often) in the illustration of different distributed lag 
models, they are being reproduced here. It is a good data set for some beginning 
exercises in distributed lag estimation. 

Before we impose any functional forms on the data, it would be interesting 
to see what some OLS estimates of unrestricted distributed lag equations look 
like. There are a total of 60 observations and it was decided to estimate 12 
lagged coefficients. The implicit assumptions is that the maximum time lag be¬ 
tween appropriations and expenditures is 3 years. To make the estimated a 2 
comparable, the equations were all estimated using the last 48 observations of 
the dependent variable Y. The results are presented in Table 10.4. 

There are several features of the lag distributions in Table 10.4 that are inter¬ 
esting. The R 2 steadily increases until we use seven lags. The sum of the coef¬ 
ficients also increases steadily. Overall, from the results presented there, it ap¬ 
pears that a lag distribution using seven lags is appropriate. This corresponds 
to a maximum of 1.75-year lag between capital appropriations and expendi¬ 
tures. Further, about 98.6% of all appropriations are eventually spent. 

One disturbing feature of the OLS estimates is that no matter how many lags 
we include, the DW test statistic shows significant positive correlation. For 
instance, the following are the DW test statistics for different lengths of the lag 
distribution. 


Length of Lag 

DW 

0 

0.38 

4 

0.58 

8 

0.35 

12 

0.44 


W L. G. Godfrey and D. S. Poskitt, “Testing the Restrictions of the Almon Lag Technique,” 
Journal of the American Statistical Association, Vol. 70, March 1975, pp. 105-108. 
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Table 10.3 Capital Expenditures (F) and Appropriations (A') for 1953-1967 
(Quarterly Data) 


N 

Y 

X 

N 

Y 

X 

N 

Y 

X 

1 

2072 

1660 

21 

2697 

1511 

41 

2601 

2629 

2 

2077 

1926 

22 



42 

2648 

3133 

3 

2078 

2181 

23 

2140 

1990 

43 


3449 

4 

2043 

1897 

24 

2012 

1993 

44 

2937 

3764 

5 

2062 

1695 

25 

2071 

2520 

45 

3136 

3983 

6 

2067 

1705 

26 

2192 

2804 

46 


4381 

7 

1964 

1731 

27 

2240 

2919 

47 

3514 

4786 

8 

1981 

2151 

28 

2421 

3024 

48 

3815 

4094 

9 

1914 

2556 

29 

2639 

2725 

49 

mSm 

4870 

10 

1991 

3152 

30 

2733 

2321 


HE 

5344 

11 

2129 

3763 

31 

2721 

2131 

51 

4531 

5433 

12 

2309 

3903 

32 

2640 

2552 

52 

4825 

5911 

13 

2614 

3912 

33 

2513 

2234 

53 


6109 

14 

2896 

3571 

34 

2448 

2282 

54 

5319 

6542 

15 

3058 

3199 

35 

2429 

2533 

55 

5574 

5785 

16 

3309 

3262 

36 

2516 

2517 

56 

5749 

5707 

17 

3446 

3476 

37 

2534 

2772 

57 

5715 

5412 

18 

3466 

2993 

38 

2494 

2380 

58 

5637 

5465 

19 

3435 

2262 

39 

2596 

2568 

59 

5383 

5550 

20 

3183 

2011 

40 

2572 

2944 


5467 

5465 


This suggests that we should be estimating some more general dynamic 
models, allowing for autocorrelated errors. 

Choosing the Degree of the Polynomial 

Consider the data in Table 10.3. We will consider a lag length of 12 and consider 
the choice of the degrees of the polynomial by the sequential testing method 
outlined here. We start with a fourth-degree polynomial. 

The results are as follows (figures in parentheses are t-ratios, not standard 
errors): 


Coefficient 

of: 


Equation 



1 

2 

3 

4 


0.1150 

0.1108 

0.1468 

0.1637 

(5.78) 

(5.80) 

(14.20) 

(33.06) 

nT 

X 

T 

o 

0.2003 

0.3305 

0.0507 

-0.1463 

(0 82) 

(1.83) i 

(-0.97) 

(-16.70) 

10~ 2 x z 2 , 

10~ 3 x z 3t 

-0.2681 
( — 0.33) 

0.4517 

(-0.41) 

-0.8549 

(-2.42) i 

0.4011 

(2.20) 

0.0824 

(-1.86) 
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Coefficient 

of: 


Equation 


1 

2 

3 

4 

10" 4 x z 4 , 

d 2 

0.3951 

(0.79) 

15,520 

15,389 

16.702 

17,590 

DW 

0.514 

0.465 

0.518 

0.536 


0.9988 

0.9988 

0.9987 

0.9986 


First we test the coefficient of z 4 , at the 5% level and do not reject the hy¬ 
pothesis that it is zero. Next we test the coefficient of z,„ and we reject the 
hypothesis that its coefficient is zero. Since this is the first hypothesis rejected, 
we use a polynomial of the third degree. The results of the other lower-degree 
polynomials are not needed and are just presented for the sake of curiosity. 
Actually, for the fourth-degree polynomial, the high R 2 and the low 6-ratios 
indicates that there is high multicollinearity among the variables. 

We now estimate the coefficients of the lag distribution using the formula 

(3, = 0.1108 + 0.3305 x 10 >i - 0.8549 x lO" 2 / 2 + 0.4011 x 10“ 3 / 3 
The coefficients are: 


Lag 

Coefficient 

Lag 

Coefficient 

Lag 

Coefficient 

0 

1 

0.1108 

0.1357 

5 

0.1125 

9 

0.0082 

2 

0.1459 

6 

0.0880 

10 

-0.0125 

3 

0.1438 

7 

0.0608 

11 

-0.0262 

4 

0.1319 

8 

0.0334 

12 

-0.0306 


The sum of the coefficients is 0.9017, and furthermore, the last three coeffi¬ 
cients are negative. This is the reason that, as mentioned earlier, some endpoint 
restrictions are often imposed in estimating the polynomial distributed lags. 
Comparing these results with the unrestricted OLS estimates in Table 10.4, it 
appears that not much is to be gained (in this particular example) by using the 
polynomial lag. 

In any case, the example illustrates the procedure of choice of the degree of 
the polynomial for given lag length. 


10.9 Rational Lags 


In Section 10.8 we discussed finite lag distributions. We will now discuss infi¬ 
nite lag distributions. Actually, we considered one earlier: the geometric or 
Koyck lag. A straightforward generalization of this is the rational lag distribu¬ 
tion. 18 


I8 D. W. Jorgenson, “Rational Distributed Lag Functions,” Econometrica, Vol. 34, January 
1966, pp. 135-149. 
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jc,_3 0.131 (1.1) 0.648 (5.9) 

jc,_4 0.476 (5.8) 

Sum of 0.965 0.955 0.942 0.929 0.915 

coefficients 

R~ 0.9965 0.9941 0.9896 0.9829 0.9742 
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Consider equation (10.11). One way of generalizing it is to add more lags on 
the left-hand side and right-hand side of the equation, and have the parameters 
different. If p is the number of lags on the right hand side and q the number of 
lags on the left-hand side, we have a rational distributed lag model which we 
denote by RDL (p, q). For instance, the RDL (3, 2) model corresponding to 
(10.11) is ) 

-Cl + 0»X* + PlCl = | + «2 X (-2 + «3*;.3 

A similar generalization can be given for the partial adjustment model (10.16). 
We have 

y, + Pi.y,.I + Pay, 2 = <* 0 >f + 1 + -2'+ « 3 >f -3 

As before, we have to somehow eliminate the unobserved variables x, and >f. 
Since the algebraic details are similar to those in Section 10.7, we will not 
pursue them here. 19 Lucas and Rapping 20 use the RDL model for price expec¬ 
tations, instead of the adaptive expectations model, but they estimate it in the 
autoregressive form ignoring serial correlation. 

The rational distributed lag model and its use in expectations should not be 
confused with the rational expectations models we discuss in the next section. 


10.10 Rational Expectations 


In Sections 10.2 and 10.3 we discussed some naive models of expectations and 
the adaptive expectations model. We also saw how the latter can be generalized 
using some polynomial lags. During the last decade a new theory, that of “ra¬ 
tional expectations,” has taken a strong hold in almost all econometric work 
on expectations. 

The basic idea of “rational expectations” comes from a pathbreaking paper 
by John Muth, 21 who observed that the various expectational formulas that 
were used in the analysis of dynamic economic models had little resemblance 
to the way the economy works. If the economic system changes, the way ex¬ 
pectations are formed should change, but the traditional models of expectations 
do not permit any such changes. The adaptive expectations formula, for in¬ 
stance, says that economic agents revise their expectations upward or down¬ 
ward based on the most recent error. The formula says 


l9 The estimation of the RDL model in the distributed lag form is discussed in G. S. Maddala, 
Econometrics (New York: McGraw-Hill, 1977), pp. 366-367. 

20 R. E. Lucas, Jr., and L. A. Rapping, “Price Expectations and the Phillips Curve,” The Amer¬ 
ican Economic Review, Vol. 59, 1969, pp. 342-350. They warn that this should not be confused 
with the “rational expectations” model as defined by Muth (which we discuss in the next sec¬ 
tion). 

2l John F. Muth, “Rational Expectations and the Theory of Price Movements,” Econometrica. 
Vol. 29, 1961, pp. 315-335. 
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y * - y’-i = - y'-i) o < x < l 

where y* is the expectation for y, as formed at time t - 1. In this formula X is 
a constant. Of course, one can modify this formula so that X depends on what¬ 
ever variables produce changes in the economic system. 

There are some reasonable requirements that y" or the predicted value of .y, 
should satisfy. Consider the prediction error 

e, = y, - y, 

It is reasonable to require that expectations be unbiased, that is, 

£(e,) = 0 

Otherwise, there is a systematic component in the forecast error which fore¬ 
casters should be able to correct. 

Muth also required that the prediction error be uncorrelated with the entire 
information set that is available to the forecaster at the time the prediction is 
made. If we denote by /, , the information available at time (t — 1) and write 

y, = y? + e, (10.30) 

they yi depends on /, , and y* is uncorrelated with e,. One implication of this 
equation is that varfy,) = var( y ’) + var(e,) and hence var(y,) s var(v'). If the 
prediction error is correlated with any variables in I, ,, it implies that the fore¬ 
caster has not used all the available information. Taking (mathematical) expec¬ 
tations of all variables in (10.30), we get 

y, = I 00.31) 

The left-hand side of (10.31) should be interpreted as the subjective expec¬ 
tation and the right-hand side of (10.31) as the objective expectation conditional 
on data available when the expectation was formed. Thus there is a connection 
between the subjective beliefs of economic agents and the actual behavior of 
the economic system. 

Formula (10.31) forms the basis of almost all the econometric work on ra¬ 
tional expectations. There are three assumptions involved in the use of formula 
(10.31). 

1. There exists a unique mathematical expectation of the random variable 
y, based on the given set of information 

2. Economic agents behave as if they know this conditional expectation and 
equate their own subjective expectation of y, to this conditional expec¬ 
tation. Note that this implies that the economic agents behave as if they 
have full knowledge about the model that the econometrician is estimat¬ 
ing, that is, they behave as if they know not only the structure of the 
model but the parameters as well. 

3. The econometrician, however, does not know the parameters of the 
model but has to make inferences about them based on assumption 2 
about the behavior of economic agents and the resulting stochastic be¬ 
havior of the system. 
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There have been many criticisms of the rational expectations hypothesis as 
implied by equation (10.31). The basic argument is that “rational” economic 
agents need not behave this way. We will not go through these criticisms here. 
The whole controversy surrounding the rational expectations hypothesis could 
have been avoided if the term “rational” was not used to describe the mecha¬ 
nism of expectation formation given by (10.31). A more appropriate term to 
characterize (10.31) is model consistent, because the expression for yj derived 
from (10.31) depends on the particular model we start with. The coinage of the 
term “rational” is rather unfortunate. We will, however, continue to use the 
word “rational” to describe “model consistent” or Muthian expectations 
formed using (10.31). 

The requirement that v* should completely summarize the information in , 
has led Lovell” to suggest the term “sufficient expectations” because it is re¬ 
lated to the statistical concept of a “sufficient estimator" which may be loosely 
defined as an estimator that utilizes all the information available in the sample. 

There are, basically, two methods of estimation for rational expectations 
models. The first method involves using the definition (10.31) and writing 

yi = y, - v, 

where v, is uncorrelated with all the variables in the information set I, ,. We 
then estimate this model using the instrumental variable methods using the 
appropriate instruments. This method is very easy to apply and gives consis¬ 
tent. though not efficient estimators. We will illustrate this using the hyperin¬ 
flation model discussed in Section 10.5. 

The second method uses information on the structure of the model to derive 
an explicit expression for y]. This method involves the following steps: 

1. Derive the equation for y, from the model you start with. 

2. Take expectations of y, conditional on /, , and substitute the resulting 
expression forvj in the model. 

3. Reestimate the model with this substitution. 

We will illustrate this with a simple demand and supply model in Section 
10.12. We will also present the results from the instrumental variable method 
for purposes of comparison. 

The instrumental variable method, although not efficient, gives consistent 
estimators of the parameters in all rational expectations models and is worth 
using at least as an initial step. Consider, for instance, Cagan's model of hy¬ 
perinflation. The model is 

m, - p, = a + b(p' l+t - p,) + u, 

Now substitute p t + , - v (+ , for p" + ,. where r,,, is uncorrelated with all vari¬ 
ables in the information set /,. We can write the model as 

m, - p, = a + &(p, +l - p,) + u, - bv t+x 

a M. C. Lovell, "Tests of the Rational Expectations Hypothesis.” The American Economic 
Review, Vol. 76. March 1986. pp. 110-124. 
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This equation cannot be estimated by OLS because + , is correlated with v, + , 
and p, with u r If m, and p, are known at time r, then v ( + l will be uncorrelated 
with these variables. One needs an instrument that is uncorrelated with u, and 
v,.Valid instruments are m, ,, p r l , and higher-order lags of m, and p,. One 
can regress + , - p, on lagged values of m, and p, and use the 2SLS method. 

The estimation of the hyperinflation model under rational expectations using 
the above-mentioned instrumental variable method using the data in Tables 
10.1 and 10.2 is left as an exercise. 


10.11 Tests for Rationality 


There is a considerable amount of literature on what is known as “tests for 
rationality.” In this literature we do not start with any economic model. Usu¬ 
ally, we have observations on y’ from survey data or other sources and we test 
whether the forecast error y, - y] is uncorrelated with variables in the infor¬ 
mation set It is customary to start with a test of unbiasedness by estimat¬ 
ing the regression equation 

y, = Po + Pi?* + e, 

and testing the hypothesis 3o = 0, 3, = 1. 

Further, since y,_, is definitely in the information set /,_ „ the following equa¬ 
tion is estimated: 

y, - y, = a<> + aiy f -i + «« 

and the hypothesis a*, = 0, a, = 0 is tested. Rationality implies that a, = 0. 
An alternative equation that can be estimated is 

.y, - .V,' = Oto + 0,0-,- y,‘_,) + e, 

Again, rationality implies that a, = 0. If the forecast errors exhibit a significant 
nonzero mean and serial correlation (significant a,), then this implies that the 
information contained in past forecast errors was not fully utilized in forming 
future predictions. 

Tests based on y,_, and (y,_, — y*_,) examine what is known as the weak 
version of the rational expectations hypotheses and are tests for weak ratio¬ 
nality. The strong version says that the forecast error (y, — yj) is uncorrelated 
with all the variables known to the forecaster. 

Yet another test for rationality is that var(y,) > var(y*). As we saw earlier if 

y, = yi + e, 

and e, is uncorrelated with y',, then varfy,) = var(y’) + var(E r ). Hence 
vaiiy,) & var(y,*). 

Lovell 2 ’ considers these two tests: 


’’Lovell. “Tests. 
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1. Tests based on v,_|. 

2. Tests based on varfy,) > var(yi). 

He examines the evidence from a number of surveys on sales and inventory 
expectations, price expectations, wage expectations, data revisions, and so on, 
and argues that in a majority of cases the tests for rationality reject the hypoth¬ 
esis of rationality. 

Some studies that test for rationality of survey forecasts like those by Pe- 
sando, Carlson, and Mullineaux for price expectations 24 and tests by Friedman 
for interest-rate expectations 2 ' use a different procedure. What they do is the 
following: 

1. Regress y on the variables in the information set A,_,. 

2. Regress y* on the same variables in the information set 

3. Test the equality of coefficients in the two regressions using the Chow 
test described in Chapter 4. 

However, note that the variance of the error terms in the two equations are 
likely to be unequal, since var(y,) is expected to be greater than var(v,'). Hence 
some adjustment should be made for this before applying the Chow test. (Di¬ 
vide each data set by the estimated standard deviation of the error term.) 

This test is formally identical to the test suggested earlier of regressing the 
forecast error y, - y* on the variables in and checking that their coeffi¬ 
cients are zero. Without any loss of generality, suppose that there are two vari¬ 
ables z, and z 2 in the information set (z, and z 2 can be sets of variables). 
We can write 


.V = Pi 2 | + P2Z2 + u 
y* = pjz, + p;z 2 + «* 

Then , 

C y ~ y*) = (Pi - P!)Zi + (P 2 - P’)^ + (« - «*) (10.32) 

Thus a regression of (y - y*) on z, and z 2 and testing whether the coefficients 
of z, and z, are significantly different from zero is the same as testing the equal¬ 
ity of the coefficients of z, and z 2 in the regressions of y and y* on these vari¬ 
ables. There is something to be said in the favor of the test based on (10.32) 
instead of the one based on separate regressions for y and y*. With the test 
based on (10.32). the fact that var(w) and var(«*) are different does not matter. 

What happens if in the tests for rationality we use only a subset of the vari¬ 
ables in the information set For instance, in equation (10.32) we just 

2, J. E. Pesando, "A Note on the Rationality of the Livingston Price Expectations,” Journal of 
Political Economy. Vol. 83, August 1975. pp. 849-858: J. A. Carlson, "A Study of Price Fore¬ 
casts," Annals of Economic and Social Measurement, Vol. 6, Winter 1977, pp. 27-56: D. J. 
Mullineaux. “On Testing for Rationality: Another Look at the Livingston Price Expectations 
Data," Journal of Political Economy. Vol. 86. April 1978. pp. 329-336. 

JS B, M. Friedman. “Survey Evidence on the 'Rationality' of Interest Rate Expectations.” Jour¬ 
nal of Monetary Economics. Vol. 6, October 1980, pp. 453-466. 
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regress y - y* on z, only and test whether the coefficient of z, is significantly 
different from zero. Recall that this is what we do in tests for weak rationality 
where we regress y, - y‘ on y,_,. 

Suppose that (y - y*) is uncorrelated with z, but not with z ; , so that in 
equation (10.32), (3, ~ Pi ^ 0. This implies that the expectations y* are not 
rational. Now suppose that Zi and z, are uncorrelated. We can still get the result 
that the coefficient of Z| in a regression of y - y* on z, is zero. Thus if we do 
not reject the hypothesis that this coefficient is zero, it does not necessarily 
mean that we can accept the hypothesis of rationality. 26 This implies that tests 
of weak rationality are indeed weak. However, in practice, it is highly unlikely 
that the variables in the information set A, , are uncorrelated. Hence one can 
have sufficient confidence in the weak tests and one need not worry that all the 
variables in the information set /,_, are not included in the tests. 

10.12 Estimation of a Demand and Supply 
Model Under Rational Expectations 


We will illustrate the estimation of econometric models with rational expecta¬ 
tions using equations (10.30) and (10.31) by considering a simple demand and 
supply model. Consider the following model (all variables considered as devia¬ 
tions from their means): 

q, = Pi P, + TiZir + «ir demand function (10.33) 

= P 2 p] + IiZti + u 2 , supply function (10.34) 

where q, = quantity demanded or supplied (both are same by the 
assumption of equilibrium) 
p, = market price 

pi = market price at time / as expected at time (r - 1); 

we assume that there is one period lag between production 
decisions and market supply 
Z|„ Zfc = exogenous variables 

u,„ ut, = serially uncorrelated disturbances with the usual properties 
1 (zero mean, constant variances, and constant covariance) 

We consider two cases: z„ and z : , known, and unknown, at time (/ — 1). 

Case 1 

z,, and z 2 , are known at time (t - 1), that is, they are in the information set 
In this case the estimation of the model is very simple. The rational expecta¬ 
tions hypothesis implies that 


“A. Abel and F. S. Mishkin. "An Integrated View of Tests of Rationality. Market Efficiency 
and Short-Run Neutrality of Monetary Policy.” Journal of Monetary Economu s, Vol. II, 1983, 
pp. 3-24 
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p, = p', + «/ 

where E( e,) = 0 and cov(e,, z„) = cov(f„ z 2i ) = 0. Thus e, has the same 
stochastic properties as u„ and u v . Now substituting p\ = p, - e, in (10.34) we 

get •' < 

' ' <7, = f^tP, + y&v + («2< - P’C/) (10.34') 

The composite error in this equation has the same properties as the error term 
in (10.34) (zero mean, and zero correlation with z„ and z 2l ). 

Thus the implication of the “rational” expectations hypothesis is that all we 
have to do is to substitute p, for p* and proceed with the estimation as usual. 
Thus the rational expectations hypothesis greatly simplifies the estimation pro¬ 
cedures. This conclusion holds good in any simultaneous equations model in¬ 
volving expectations of current endogenous variables if the current exogenous 
variables are assumed to be known at time (/ — 1) (or at the time expectations 
are formed), and the errors are serially independent. 

There is, however, one problem that the “rational” expectations hypothesis 
creates. Suppose that the variable z 2l is missing in equation (10.34). Then with 
any exogenous specification of the expectational variable p] (as in the adaptive 
expectations formulation), the equation system would be identified. On the 
other hand, under the “rational” expectations hypothesis the demand function 
is not identified. Thus, in the general simultaneous-equations model with ex¬ 
pectations of current endogenous variable, the identification properties will de¬ 
pend on the identification properties of the simultaneous equations model re¬ 
sulting from the substitution of y, for y ( \ where y, is the expectation of the 
endogenous variable y,. This result makes intuitive sense since what the ra¬ 
tional expectations hypothesis does is to make the expectations endogenous. 


Case 2 

z„ and z 2 , are not known at time (/ - 1). That is. the exogenous variables at 
time / have to be forecast at time (t - 1). What will happen if we just substitute 
P/<f*iPr“ *» as before? Since e, is not uncorrelated with z„ and z 2 , [they are 
not in the information set at time (r — 1)], the composite error term in (10.34') 
will be correlated with z„ and z 2r 

How do we estimate this model, then? What we have to do is to add equa¬ 
tions for z„ and z 2r A common procedure is to specify them as autoregressions. 
For simplicity we will specify them as first-order autoregressions. Then we 
have 


Zu =• a i*M-i + v„ (10.35) 

ZV = «jZ2rf-l + *2/ 

We now estimate equations (10.33), (10.34'), and (10.35) together. 17 

«S«e to. R. Wkkens, "The Efficient Estimation of Econometric Models with Rational Expec¬ 
tations.” Review of Economic Studies, Vol. 49, 1982, pp. 55-67. who calls this method the 
"errors in variables method” (EVM). 
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An alternative way of looking at this is to say that we estimate the demand 
and supply model given by (10.33) and (10.34) by replacing p\ by p, and using 
lagged values of z„ and z 2 , as instrumental variables. 

There is an alternative method of estimating the model with rational expec¬ 
tations which is somewhat more complicated. This is called the substitution 
method (SM). In this method we derive an expression for p] using the relation¬ 
ship 

p, = E(p,|/,_|) 

and substitute it in the model and then estimate the model. For this we proceed 
as follows. 

Equating demand and supply we get the equilibrium condition. 

PiP, + 7i*i, + «i, = P -p, + y 2 z 2 , + u 2l (10.36) 

Now take expectations throughout, conditional on the information set 
Note that E(u „|/,_,) = 0 and = 0. Hence we get 

PiA + 7 i4 = P 2 P, + 7 2 4 (10.37) 


where 

4 = E(z„|/,-,) and 4 = £(44i) 

These are, respectively, the expectations of z„ and z 2 , as of time (t — 1). Let 
us define 


z„ = Zu + w>„ 

Z 2l = 4 + H>j, 


Subtracting (10.37) from (10.36), we get 


or 


Pi(p, - p',) + •yiW’w + Uu = ‘YzM'z, + « 2 , 


Pf = Pi + 7T (7iW., - y 2 a’ 2 , + u u - u 2l ) 

Pi 


We now substitute this expression in (10.34) and get 


y I R. 

Q, = P 2 P, + 72*2, + - ~w 2 , + «2, 

Pi Pi 


where 


(10.38) 


(10.39) 


• P?, , 

«2, = «2, + ^-(«i, - «2,) 

Pi 


The error term u\, has the same properties as the error term u 2l except in the 
special case where u u and u 2l are independent. Thus the consequence of the 
Muthian rational expectations hypothesis is the inclusion of the extra variables 
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w u and h’ 2( in the supply function and the constraints on the parameters. (Note 
that the coefficients of h’„ and w\, involve no new parameters.) This is an omit¬ 
ted variable interpretation of the rational expectations models. 

If the nature of the autoregressions for z„ and z 2/ is specified, one can derive 
an alternative expression for p]. For purpose of simplicity of exposition, let us 
assume that z„ and z 2/ are first-order autoregressive as in equations (10.35). We 
then have 


Z„ ~ 


Zit — a lZi j -1 

Substituting these in (10.37), we get 


P, = a -“ 7i«iZu-i) 

Pi “ P2 

Now we substitute this expression in (10.34) and get 


<i, = 


P 2 y 2 « 2 

P. - V2 Z2 '~' 


P27|Ct| 

P. - p2 Z '-' 


-i + yiZi, + «2< 


(10.40) 


We now estimate this equation along with equation (10.33) and equations 
(10.35). Again, note that the coefficients of z,and z 2 ,_, in equation (10.40) 
do not involve any new parameters. If higher-order autoregressions are used 
for Z|, and Z 2 ,, then p] will involve more of the lagged values of these vari¬ 
ables. 

Note that even if estimates a, and a 2 are obtained from (10.35) and substi¬ 
tuted in (10.40). we still have to deal with cross-equation constraints. Tests of 
these cross-equation constraints have been often referred to as “tests for the 
rational expectations hypothesis.” 2 " The restrictions, however, arise because 
the exogenous variables z„ and z 2 , are not known at time (/ — 1) and not from 
the rational expectations hypothesis as such. Moreover, the number of restric¬ 
tions depends on the specification of the order of autoregression of the exoge¬ 
nous variables z„ and z 2( . In view of this, it might be inappropriate to name the 
tests for the restrictions as tests of the rational expectations hypothesis. Note 
also that if equation (10.39) is used, the restrictions do not depend on the spec¬ 
ification of the order of autoregression of the exogenous variables. 

We have used a simple demand and supply model to outline the problems 
that are likely to arise in the Muthian rational expectations models if the ex¬ 
ogenous variables are not known at time (/ - 1). 


*K. F. Wallis. “Econometric Implications of the Rational Expectations Hypotheses.” Econo- 
melrica. Vol. 48, 1980, pp. 49-74; N. S. Revankar. "Testing of the Rational Expectations 
Hypothesis.” Econometrua. Vol. 48. 1980. pp. 1347-1663; D. L. Hoffman and P. Schmidt. 
“Testing the Restrictions Implied by the Rational Expectations Hypothesis.” Journal of Econ¬ 
ometrics, February 1981, pp. 265-288. 
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Summary <■ 

There are three procedures we have outlined. 

1. Just substitute p, for p] and use lagged exogenous variables as instru¬ 
ments. This can be done for all expectational variables. 

2,, Estimate autoregressions for z„ and z 2t . Get the estimated residuals ti>„ 
and w 2l . Use these as additional regressors as in equation (10.39) after 
substituting p, for p',. Estimate the equations using the parameter con¬ 
straints (computer programs like TSP do this). 

3. Get an expression for p j based on the structure of the model and the 
structure of the exogenous variables as in equation (10.40). Estimate the 
equations using the parameter constraints. 

Note that method 1 uses equation (10.30), which says that p, = p’ + e ( . 

Methods 2 and 3 use equation (10.31), which says that 

P, = E(P,\I,- i) 

The latter methods use the structure of the model in deriving an expression for 

p\. Method 3 uses the structure of the exogenous variables as well. 


Illustrative Example 29 

We will illustrate the methods we have described, with the estimation of a de¬ 
mand and supply model. In Table 10.5 we present data for the 1964-1984 grow¬ 
ing seasons on fresh strawberries in Florida. The variables in the table are 

P, = price (cents per flat) 

Q, — number of flats (thousands) 

i., d, = food price deflator (1972 = 1.00) 

C, = production cost index (1977 =100) 

N, = U.S. population (millions) • 

' X, = real per capita food expenditures (dollars per year) 

The model estimated was the following 

Q, = a„ + a,(P‘ - C,) + a+ u„ supply 

P, - d, = p 0 + p,({7, - N,) + $ 2 X, + p,/ + u 2 , demand 

All variables are in logs except the time trend /. (Note that the data in Table 
10.5 are not in logs.) P] is the expected price (as expected at time / - 1). The 
different estimation methods depend on different specifications of P’,. 

Five different models have been estimated (except for the first model, the 
other models assume rational expectations): 


would like to thank my colleague. J Scott Shonkwiler. for the data, model, and computa¬ 
tions. 
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Table 10.5 Data on Florida Fresh Strawberries, 1964-1984 


Year 

P, 

Q, 

d, 

c, 

N, 

x, 

1964 

378 

2134 

0.74 

78.97 

192 

651.04 

1965 

. 397, 

1742 

0.76 

77.84 

194 

675.26 

1966 

395 

1467 

0.79 

81.16 

197 

685.28 

1967 . 

* 346 

1267 

0.80 

81.89 

199 

688.44 

1968-,.. 

«. 391 

- 1333 

0.84 

78.97 

201 

706.47 

1969 

553 

1200 

0.88 

79.76 

203 

719.21 

1970 

419 

1467 

0.93 

74.67 

205 

731.71 

1971 

379 

1667 

0.95 

72.31 

208 

725.96 

1972' 

* 516 

1575 

1.00 

67.41 

210 

738.10 

1973 

457 

1467 

1.12 

68.66 

212 

721.70 

1974 

508 

1650 

1.28 

82.98 

214 

710.28 

1975 

506 

1750 

1.37 

100.69 

216 

722.22 

1976 

493 

1817 

1.41 

98.93 

218 

752.29 

1977 

689 

2417 

1.46 

100.00 

220 

777.27 

1978 

692 

3200 

1.60 

98.90 

223 

771.30 

1979. 

. 706 

3958 

1.77 

99.50 

225 

782.22 

1980 

498 

5600 

1.91 

103.87 

228 

793.86 

1981 

644 

8125 

2.07 

108.58 

230 

786.% 

1982 

614 

8550 

2.16 

112.89 

232 

784.48 

1983 

538 

7225 

2.20 

118.63 

234 

807.69 

1984 

694 

8833 

2.28 

121.03 

237 

818.57 


1. The Cobweb model. In this model we substitute for P\ and estimate 
the demand and supply functions by OLS. 

2. The 2SLS (Two-Stage Least Squares) method. This is the procedure for 
the rational expectations model if the exogenous variables are assumed 
to be known at time t - 1 and the errors are not serially correlated. Since 
the rational expectations hypothesis implies 

p; = p,-v, 

‘ where v, is an error uncorrelated with the variables in the information set 
/,_ |, we just substitute P, - v, for P] and combine the error v, with the 
error in the supply function. Since v, has the same properties as w„, we 
just estimate the model by two-stage least squares (2SLS). 

3. The IV (Instrumental Variable) method. This is the method we use to get 
consistent estimators of the parameters under rational expectations when 
the exogenous variables at time t are not known at time (t - 1). In this 
case, the error v, can be correlated with the exogenous variables. We 
used the lagged exogenous variables as instruments. 

4. The OV (Omitted-Variable) method. In this method we first estimated 
first-order autoregressions for the exogenous variables and used the re¬ 
siduals as additional explanatory variables in the supply function [esti- 
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mating an equation like (10.39)]. imposing no parameter constraints. Af¬ 
ter obtaining initial estimates of the parameters in the demand and supply 
functions the predicted value of P] was obtained as in equation (10.38). 
This was introduced in the supply function which was then reestimated 
by OLS. This is just a two-stage method of imposing the parameter con¬ 
straints in (10.39). 

5. The joint estimation method. In this model the explicit expression for the 
rational expectation P] is derived and then equation (10.40) is estimated 
jointly with the demand function imposing parameter constraints. 

The results from these five methods of estimation are presented in Tables 
10.6 and 10.7. Regarding the estimates of the supply function, the price coef¬ 
ficient is significant only when the estimation is done by the 2SLS method or 
the joint estimation method. The IV and OV methods which are less efficient 
methods for the estimation of rational expectations models gave rather worse 
results (the results, however, for the two methods are very close to each other). 


Table 10.6 Estimates of the Parameters of the Supply Function* 



Cobweb 

2SLS 

IV 

OV 

Joint 

Estimation 

<*o 

-0.997 

-1.593 

-1.427 

-1.371 

-1.815 


(0 594) 

(0.74J) 

(0.762) 

(0 726) 

(0 747) 

<*l 

0.310 

0.604 



0.713 


(0.213) 

(0.301) 

(0.318) 


(0.315) 

<*2 

1.070 

1.080 



1.085 


(0.057) 

(0 058) 

(0.057) 

(0.075) 

(0.054) 

DW* 

1.573 

1.642 

1.326 


— 

"Figures in parentheses are standard errors. 

*DW is the Durbin-Watson test statistic. 


Table 10.7 Estimates of the Parameters of the Demand Function" 


Cobweb 

2SLS 

IV 

OV 

Joint 

Estimation 

Po 

-9.796 

-9.476 

- 14.896 

-9.561 

-9.531 


(9.848) 

(9.871) 

(11.643) 

(9.860) 

(9.720) 

P. 

-0.121 

-0.139 

-0.136 

-0.134 

..-0.133 


(0.075) 

(0.079) 

(0.078) 

(0-077) 

(0.077) 

P 2 

2.513 

2.468 

3.302 

2.480 

2.477 


<1 513) 

IJ 516) 

(1,789) 

(1 515) 

(1.493) 

P 3 

-0.047 

-0.045 

-0.053 

-0.045 

-0.046 


(0.017) 

(0.018) 

(0.020) 

(0.017) 

(0.017) 

DW* 

2.286 

2.315 

2.306 

2.245 

— 


"Figures in parentheses are standard errors. 
*DW is the Durbin-Watson test statistic. 
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The Cobweb model gave the worst results. Of some concern is the fact that the 
coefficient a, of is not significantly different from 1, and the DW statistic 
is low fora model with a lagged dependent variable. Both these results indicate 
that the model is misspecified. 

Regarding the estimates of the demand function (presented in Table 10.7), 
the Cobweb model and the 2SLS, OV, and joint estimation methods for the 
rational expectations model, all gave very similar results. It is only the IV 
method that gives slightly different results. The results are not very encourag¬ 
ing, although Pi and p : both have expected signs. The negative coefficient of 
p 3 implies a gradual downward shift of the demand function. . 

Overall, we cannot say that the data are very informative about the models. 
All we can say, perhaps, is that the rational expectations model appears to be 
more appropriate than the Cobweb model. Many coefficients are not significant 
(although they have signs that one would expect a priori). The computations, 
however, illustrate the methods described. In estimating the rational expecta¬ 
tions models, we used progressively more information about the models in the 
four methods described. Although the results are not much different in this 
case, one would be able to find differences in other cases. The estimation of 
the hyperinflation model described in Section 10.5 is left as an exercise. 

10.13 The Serial Correlation Problem in 
Rational Expectations Models 


Until now we have assumed that the errors are serially uncorrelated. If they 
are serially correlated, we cannot use variables in the information set as 
instruments. The following example illustrates the problem. Consider the 
model 

y, ■* our* + P z, + ii, (10.41) 

z, is an exogenous variable, but x, can be an exogenous or endogenous variable. 
We will assume that z, is not a part of the information set This means that 
Z, should also be treated as endogenous if we substitute x, for x‘ t . Finally, u, 
follow a first-order autoregressive process 

u, = pn,_i + v, |p| < 1 and v, ~ IN(0, a 2 ) 

Let us write, as usual, equation (10.41) as 

y, “ py,_i + ax] ~ «P*,’-i + Pz, - Ppz,-i + v, 

Using the rational expectations assumption and substituting x] = x, - e„ we 
get 

>, “ py t -i + ax, - apx,_, + Pz, - ppz,_, + w, (10.42) 

where w, = v, - a(e, - pe,_,). 

We would like to estimate this equation by instrumental variables. Note that 
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x f _, correlated with e,_, and hence w,. Thus although y,_,, „ and z, , are 

definitely in the information set /,_,, we cannot use them as instrumental vari¬ 
ables. 

The solution to the problem is to use the set of variables in rather than 
Thus we obtain the predicted values x„ z„ z,_ t , y ,by regressing 
each of these variables on the variables in the information set I, 2 and use these 
as instruments for estimating equation (10.42). Note that if the degree of auto¬ 
regression in u, is of a higher order, we have to use the variables in the infor¬ 
mation set of a higher order of lag. For instance, if u, is a second-order auto¬ 
regression, we use and so on. 

Thus with serial correlation in errors we should not use the variables in the 
information set /, _, to construct the instrumental variables in models with ra¬ 
tional expectations. 30 

Summary 


1. In this chapter we have discussed three types of models of expectations: 

(a) Naive models of expectations. 

(b) The adaptive expectations model. 

(c) The rational expectations model. 

Naive models merely serve as benchmarks against which other models or sur¬ 
vey data on expectations are judged. 

2. Two methods of estimation of the adaptive expectations model have been 
discussed: 

(a) Estimation in the autoregressive form. 

(b) Estimation in the distributed lag form. 

The first method is easy to use because one can use an OLS regression pro¬ 
gram. However, this method is not advisable. The second method is the better 
one. It can be implemented with an OLS regression program by using a search 
method (see Section 10.4). 

3. Many economic variables when disturbed from their equilibrium position 
do not adjust instantly to their new equilibrium position. There are some lags 
in adjustment. The partial equilibrium model has been suggested to handle this 
problem. Recently, a generalization of this model, the error correction model, 
has been found to be more useful (see Section 10.6). One can combine these 
partial adjustment models with the adaptive (or any other) model of expecta¬ 
tions (see Section 10.7). 

4. Besides the partial adjustment and error correction models, two other 
models are commonly used in the literature on adjustment lags. These are: 

(a) The polynomial lag (also known as the Almon lag) (Section 10.8). 

(b) The rational lags (Section 10.9). 

W A concise discussion of methods of obtaining consistent estimates of the parameters in single¬ 
equation models with rational expectations can be found in B. T. McCallum. "Topics Concern¬ 
ing the Formulation. Estimation and Use of Macroeconomic Models with Rational Expecta¬ 
tions,” Proceedings of the Business and Economic Statistics Section, American Statistical 
Association. 1979. pp. 65-72. 
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The latter should not be confused with the rational expectations model. Ra¬ 
tional here means the ratio of two polynomials. 

5. In the polynomial lags one has to choose the length of the lag and the 
degree of the polynomial. Given the lag length, some procedures have been 
suggested for these two choices. All lag distributions imply some restrictions 
on the coefficients. Before imposing these restrictions, it is always best to es¬ 
timate unconstrained lag models by OLS. 

6. The rational expectations model suggested by Muth has been very pop¬ 
ular. The idea behind the rational expectations hypothesis is that the specifi¬ 
cation of expectations should be consistent with the rest of the model rather 
than being ad hoc. A proper terminology for this is "model-consistent expec¬ 
tations” rather than “rational expectations.” Two investigators starting with 
two different models can arrive at two different expressions of the “rational 
expectation" of the same variable. However, we continued to use the word 
“rational” rather than “model consistent” because the former is the commonly 
used term. 

7. The essence of the rational expectations hypothesis is that the difference 
between the realized value and the expected value should be uncorrelated with 
all the variables in the information set at the time the expectation is formed. 
Based on this, several tests for “rationality” have been applied to survey data 
on expectations. These tests are described in Section 10.11. In a large number 
of cases these tests reject the “rationality” of survey data on expectations. 

8. There are. broadly speaking, two methods of estimation for rational ex¬ 
pectations models. One procedure involves substitution of the realized value 
for the expected value and using some appropriate instrumental variables. The 
other procedure involves obtaining an explicit expression for the expected 
value from the model, substituting this in the model and then estimating the 
model using any parameter constraints that are implied. These procedures are 
illustrated with reference to a demand and supply model in Section 10.12. 

9. With serially correlated errors one has to be careful in the choice of 
appropriate instruments when estimating rational expectations models. A care¬ 
ful examination of which variables are correlated or uncorrelated with errors 
will reveal what variables are valid instruments. This is illustrated with a first- 
order autoregression in Section 10.13. 

10. Some empirical examples are provided for the Almon lag, but computa¬ 
tion* of the Koyck model and some rational expectations models are left as 
exercises; 


Exercises 


1. Explain the meaning of each of the following terms. 

(a) Adaptive expectations. 

(b) Regressive expectations. 

(c) Rational expectations. 
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(d) Koyck lag. 

(e) Almon lag. 

(f) Rational lags. 

(g) Partial adjustment model. 

(h) Error correction model. 

(i) Testing rationality. 

(j) Estimation in autoregressive form. 

(k) Estimation in distributed lag form. 

2. What are the problems one encounters in the OLS estimation under adaptive 
expectations in the following models? 

(a) Models of agricultural supply. 

(b) Models of hyperinflation. 

(c) Partial adjustment models. 

(d) Error correction models. 

3. Answer Exercise 2 if, instead of assuming adaptive expectations, we assume 
the following. 

(a) Naive expectations y", = y f _,. 

(b) Rational expectations. 

4. In the demand and supply model for pork discussed in Section 9.3 (data are 
in Table 9.1). assume that supply depends on expected price, P\. Estimate 
the model assuming the following. 

(a) Naive expectations P\ = P,^ t . This is the Cobweb model. 

(b) Adaptive expectations. 

(c) Rational expectations. 

5. Estimate the hyperinflation model for Hungary and Germany using the data 
in Tables 10.1 and 10.2 and assuming the following. 

(a) Naive expectations P ’ l+ , = P t . 

(b) Adaptive expectations. 

(c) Rational expectations. 

6. Answer Exercise 4 with the Australian wine data discussed in Section 9.5 
(data are in Table 9.2). Assume that the supply depends on expected price. 

7. Explain tests for rationality based on 

(a) y.-j. 

(b) var y’ and var y,. 

(c) Tests for equality of coefficients. 

8. What is meant by weak tests for rationality? Are these tests really weak? 

9. Future prices have often been used as a proxy for expected prices in the 
cases of estimation of supply functions for agricultural commodities. How 
do you test whether the expectations implied in the future prices are ra¬ 
tional? 
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Errors in Variables 


11.1 Introduction 

11.2 The Classical Solution for a Single-Equation Model with One 
Explanatory Variable 

11.3 The Single-Equation Model with Two Explanatory Variables 

11.4 Reverse Regression 

11.5 Instrumental Variable Methods 

11.6 Proxy Variables ' > 

11.7 Some Other Problems ' 

» # Summary 

Exercises ' - 


11.1 Introduction 

Since the early 1970s there has been a resurgence of interest in the topic of 
errors in variables models and models involving latent variables. 1 This late in¬ 
terest is perhaps surprising since there is no doubt that almost all economic 

variables are measured with error. * 5 . 

* 

'Three early papers that sparked interest in this area are: A. Zellner. “Estimation of Regression 
Relationships Containing Unobservable Independent Variables,” International Economic Re¬ 
new. October 1970. pp 441-454. A S Goldberger. "Maximum Likelihood Estimation of 
Regression Models Containing Unobservable Variables. ' InternationalEtonomu Review Jan¬ 
uary 1972. pp 1-15 and Zvi Griliches. "Errors in Variables and Other Unobservables. * Eion- 
ometrica. November 1974, pp 971-998 
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Part of the neglect of errors in variables is due to the fact that consistent 
estimation of parameters is rather difficult. During the 1970s this attitude 
changed because of the following developments: 

1. It was realized that sometimes we can use some extra information avail¬ 
able in the dynamics of the equation or the structure of other equations 
in the model to get consistent estimates of the parameters. 

2. The fact that we cannot get consistent point estimates of the parameters 
does not imply that no inference is possible. It was realized that one can 
obtain expressions for the direction of the biases and one can also obtain 
consistent “bounds” for the parameters. 

We will now discuss how consistent estimation of equations which contain 
variables that are measured with errors can be accomplished. We start our 
discussion with a single-equation model with one explanatory variable mea¬ 
sured with error. This is known as the “classical” errors in variable model. We 
then consider a single equation with two explanatory variables. Next we apply 
these results to the problem of “reverse regression,” which has been a sug¬ 
gested technique in the analysis of wage discrimination by race and sex. 

Before we proceed, we should make it clear what we mean by “errors in 
variables.” Broadly speaking, there are two types of errors that we can talk 
about. 

1. Recording errors. 

2. Errors due to using an imperfect measure of the true variable. Such im¬ 
perfect measures are known as “proxy” variables. The true variables are 
often not measurable and are called “latent” variables. 

What we will be concerned with in this chapter are errors of type 2. 

A commonly cited example of a proxy variable is years of schooling (S), 
which is a proxy for education (E). In this case, however, the error S - E is 
likely to be independent of S rather than of E. Another example is that of 
expectational variables, which must be replaced by proxies (either estimates 
generated from the past or measures obtained from survey data). Examples of 
this were given in Chapter 10. 

We start our discussion with the classical errors-in-variables model and 
then progressively relax at least some of the restrictive assumptions of this 
model. 


11.2 The Classical Solution for a 
Single-Equation Model with 
One Explanatory Variable 

Suppose that the true model is 

y = $x + e ( 11 . 1 ) 
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Instead of y and x, we measure 

Y = y + v and *X^ x + u 

u and v are measurement errors, x and y are called the systematic components. 
We will assume that the errors have zero means and variances of and 07 , re¬ 
spectively. We will also assume that they are mutually uncorrelated and are 
uncorrelated with the systematic components. That is. 

E(u) = £(v) = 0 var(w) = 07 var(v) = of 


co v(m, x) = cov(w, y) = cov(v, x) = cov(v, y) = 0 


Equation (11.1) can be written in terms of the observed variables as 

Y - v = p(Af - u) + e 


or 


Y = pAf + w 


( 11 . 2 ) 


where w = e + v - pw. The reason we cannot apply the OLS method to 
equation (11.2) is that cov(»v. AO ^ 0. In fact. 

cov(w, AO = cov(-pw, x + u) = -pof 


Thus one of the basic assumptions of least squares is violated. If only y is 
measured with error and x is measured without error, there is no problem be¬ 
cause cov(»v, AO = 0 in this case. Thus given the specification (11.1) of the true 
relationship, it is errors in x that cause a problem. 

If we estimate p by OLS applied to (11.2) we have 

2 XY + ")Cv + v) 

yx ~^ x 2 ~ 2 > + «> 2 


plim b YX = 


cov(xy) 


vqrfx) + var(u) of ,+ of 
since all cross products vanish. Since p = qr^/o^, we have 

plim byx = 


P 


1 + o]Ja\ 


01.3) 


Thus by X will underestimate p. The degree of underestimation depends on 
o',Vo;. 

If we run a reverse regression (i.e.. regress X on T). we have 

2 XY 2 (■* + w)(y + v ) 

; hi, ~ Yr’ I <» + w 

’ , 1 » 

Hence 

OjTV • 

of + of 


plim b xr 
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But from (11.1) 


Hence 


ct, 2 = P 2 ^ + and ct, v = PctJ 


• , plim-L , + °j + ^ - p( 1 

k>XY P<^ V Pct 2 / 

Thus \lb xr overestimates p, and we have 


plim b YX ^ p ^ 


plim b x 


(11.3') 


We can use the two regression coefficients b YX and b XY to get bounds on p (at 
least in large samples). 

In the preceding discussion we have implicitly assumed that p > 0. If p < 0, 
then 

plim b YX > p and (plim b XY )~ y < p 

and thus the bounds have to be reversed, that is, 

(plim b X y)~ 1 < P < plim b rx 


Note that since b xy ■ b )x = R 2 XY , the higher the value of R 2 the closer these 
bounds are. Consider, for instance, the data in Table 7.3. 2 We have the equation 

Y = 0.193*, - 15.86 R 2 = 0.969 

*, = 5.088 Y + 86.87 R 2 = 0.969 


The second equation when solved for Y gives 

Y = 0.199*, - 17.32 

Hence we have the bounds 0.193 < P, < 0.199. Such consistent estimates of 
the bounds have been studied by Frisch and Schultz.’ 

The general conclusion from (11.3) is that the least squares estimator of P is 
biased toward zero and if (11.1) has a constant term, a, the least squares esti¬ 
mator of a is biased away from zero. 4 
If we define 


X = 


a 


2 

U 


+ o 2 * 


These data are from E. Malinvaud, Statistical Methods of Econometrics, 3rd ed. (Amsterdam: 
North-Holland. 1980), p 19 

'R. Frisch. Statistical Confluence Analysis hy Means of Complete Regression Systems, Publi¬ 
cation 5 (Oslo: University Economics Institute. 1934); H. Schultz. The Theory and Measure¬ 
ment of Demand (Chicago - University of Chicago Press. 1938) 

4 In the discussion that follows we mean by bias, asymptotic bias or more precisely plim © - 
6. where 6 is the estimator of 6. 
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then from (11.3) we get the result that the asymptotic bias in the least squares 
estimator of (3 is - 0X. , 

113 The Single-Equation Model with 
Two Explanatory Variables 

In Section 11.2 we derived the bias in the least squares estimator and also some 
consistent bounds for the true parameter. We will now see how these two re¬ 
sults can be extended to the case of several explanatory variables. Since it is 
instructive to consider some simple models, we discuss a model with two ex¬ 
planatory variables in some detail: 

Two Explanatory Variables: One Measured with Error 

Let us consider the case of two explanatory variables only one of which is 
measured with error. The equation is 

>’ = + 02*2 + e 

The observed variables are 

Y = y + v X t = x, + u X 2 = x 2 

As before we will assume that the errors u, v, e are mutually uncorrelated and 
also uncorrelated with y, x, t x 2 . Note that we can combine the errors e and v 
into a single composite error in Y. Let var(w) = ai and var(v + e) = cr. The 
regression equation we can estimate in terms of observables is 

Y = 0,*, + + w 

where 


w = e + v — 0 ,m 

To save on notation (and without any loss in generality) we will normalize 
the observed variables X, and A’, so that vatfA’,) = 1, var(A’j) = 1, and 
cov(A’ l> X 2 ) = p. Note that 

cov(Af„ w) = -0|(T?, = — 0,X 

where 

• k = " 01 since varTO ‘ '• 

Hence we get 

var(K) = 0] + 0| + 2p0|0 2 + a 2 - 0|X 
cov(A r l , Y) = 0, + 0 2 p - 0|X 
cov(A',, K) = PiP + 0: 
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It is important to note that in this model there are some limits on X that need 
be imposed for the analysis to make sense. This is because the condition 

var(*,) var(jc 2 ) - [cov(jt,, jt 2 )] 2 > 0 

needs to be satisfied. 

Since var(A',) = 1 we have varU,) = 1 - X. Also, var(A'j) = varU 2 ) = 1 
and cov(jc„ jc 2 ) = cov(A,, X 2 ) = p. Hence we have the condition 

1 - X - p 2 0 or X < 1 - p 2 

This condition will turn out to be very stringent if p 2 is large or there is high 
collinearity between X t and X 2 . For instance, if p 2 = 0.99, we cannot assume 
that X > 0.01. If X = 0.01, this implies that jc, and x 2 are perfectly correlated. 

What this implies is that we cannot use the classical errors in variables model 
if A 1 , and X 2 are highly correlated and we believe that the variance of the error 
in x | is high. 

We will see the implications of this condition when we derive the probability 
limits of the least squares estimators 0, and 0 2 . The least squares estimators 
0, and 0 2 of 0, and 0 2 are obtained by solving the equations 

2 *?0, + 2 x,xA = s XJ 

2 xMx + 2 = 2 x 2 y 

Dividing by the sample size throughout and taking probability limits, we get 
var(A’,) plim 0, + cov(A,, X 2 ) plim 0 2 = cov(A’ l , Y) 
cov(A’, A’z) plim 0, + var(A 2 ) plim 0 2 = cov(A 2 , 10 

That is, 

plim 0i + p plim 0 2 
p plim 0| + plim 0 2 
Solving these equations, we get 3 

.•a « Pi* 

plim 0, = 0, - -- - 2 

1 - p^ 

plim 0 2 = 02 + ,^ |XP , 

1 - p- 

Thus bias in 0 2 = -p(bias in 0,). This result also applies if there is more than 
one explanatory variable with no errors. For instance, if the regression equa¬ 
tion is 


= Pi + P:P ~ Pi* 
= PiP + P2 



y = 0,*, + 0jX 2 + • • • + 0*x t + e 


'These results are derived in Z. Griliches and V. Ringstad. Economies of Scale and the Form 
of the Production Function (Amsterdam: North-Holland, 1971), App. C. 
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and only x, is measured with error, the first formula in (11.4) remains the same 
except that p 2 is now the square of the multiple correlation coefficient between 
A 1 , and (* 2 , * 2 .*J. As for the biases in the other coefficients, we have 

bias in p, = -y/bias in p,) 

where y, are the regression coefficients from the “auxiliary” regression of A 1 , 
on Aj, Aj, . . • , *^i 

£(*,1*2, *„..., **) = 72*2 + 7A+---+7A 

Note that we are normalizing all the observed variables to have a unit vari¬ 
ance . 6 

Returning to equations (11.4), note that if p = 0, the bias in p, is - p,X as 
derived in (11.3). Whether or not p, is biased toward zero as before depends 
on whether or not X < (I - p 2 ). As we argued earlier, this condition has to be 
imposed for the classical errors in variables model to make sense. Thus, even 
in this model we can assume that 0, is biased toward zero. As for (3, the direc¬ 
tion of bias depends on the sign of p,p. The sign of p is known from the data 
and the sign of p, is the same as the sign of p,. 

Consider now the regression of *, on Y and * 2 . Let the equation be 


*, = y, Y + -yj * 2 4- w* 


Then 




M>* 


W 

Pi 


The least squares estimators y, and y, for y, and y ; are obtained from the 
equations 


(X 1' 2 )7, + (X YXiYii = (X Y) 
(X K* 2 )y, + (X *M 2 = (X **> 


Divide throughout by the sample size and take probability limits. That is, we 
substitute population variances for sample variances and covariances. We get 


(Pi + 32 + 2pp,p : + a 2 - p 2 X) plim y, 
+ (P,p + p 2 ) plim y 2 
= P, + P2P - P,A 
(PiP + P 2 ) plim y, + plim y 2 = P 


‘Similar formulas have been derived fn M. t). Levi, “Errors in Variables Bias in the Presence 
of Correctly Measured Variables.” Eionometru a. Vol 41, 1973. pp 985-986. and Steven Gar¬ 
ber and Steven Klepper. “Extending the Classical Normal Errors in Variables Model," Econ- 
ometnca. Vol, 48. No 6. 1980. pp 1541-1546 However, these papers discuss the biases in 
terms of the variances and covariances of the true variables, jr,. x,. . . , r, as well as auxiliary 

regression of the true variable i, on *.. X t . The formulas stated here are in terms of the 

correlations of the observed variables The only unknown parameter is thus K The formulas 
presented here are practically more useful since they are in terms of the correlations of the 
observed variables. 
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Hence 


* plinth = -Oi(l - p 2 - X)] 

plim -ft = 7 t — P,P 2 (1 - P 2 - X) + pa 2 l 
A 

where 

A = p?(l - p 2 - X) + a 2 

Hence the estimates (0,, 0 2 ) of (p„ p 2 ) from the equations 

p,-J- e,-4 

7i 7i 


have the probability limits 
plim p, = p, + 


cr- 


plim p 2 = P 2 


O, (1 - p 2 - X) 

_po 2 _ 

P, (l - p : ~ X) 


P, [ ' + PHI - p>- x>] (|1J) 


Again the bias in p 2 = -p(bias in 0,). Also, if X < (1 - p 2 ) or (1 - p 2 - X) > 
0, we have the result that p, is biased away from zero. Hence 

plim p, < p, < plim 0 , if p, > 0 

and (11.6) 

plim p, > p, > plim 0 , if p, < 0 

As for p 2 the bias will depend on the sign of pp,. 

If pp, > 0, we have 

plim p 2 < 0 , < plim p 2 (11.7a) 

and if pp, < 0 , we have 

plim 02 > P 2 > pbm p 2 (11.7b) 


Illustrative Example 

Consider the data in Table 7.3 and a regression of imports ( Y) on domestic 
production (X,) and consumption (* 3 ). Note that X, and X , are very highly 
correlated, p 2 = 0.99789 or 1 - p 2 = 0.00211. Let us assume that X % has no 
measurement error but that there is measurement error in X x . As mentioned 
earlier, we have to have X < 1 - p 2 or X < 0.00211. This implies that the 
variance of the measurement error in X { cannot be greater than 0.211% of the 
variance of X y . Thus we have to make a very stringent assumption on X to do 
any sensible analysis of the data. 
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The regression equation in this case is 

Y = 0.043*, + 0.230*j - 18.63 R 2 = 0.970 

(0.1907) (0 2913) 

(Figures in parentheses are standard errors.) The reverse regression is 
*, = 0.078 Y + 1.503*3 - 16.37 R 2 = 0.998 
Which, when normalized with respect to Y, gives 

Y = 12.81*, - 19.28*3 + 2.09 

If X < 0.00211, we come to the conclusion that (3, > 0, and hence from (11.6) 
the consistent estimates of bounds are 

0.043 < 0, < 12.81 

Also, from (11.7) the consistent bounds for 0 2 are (since p0, > 0) 

‘ -19.28 < ft < 0.23 

All this merely indicates that the coefficients 0, and 0, are not estimable with 
any precision, which is what one would also learn from the high multicolli- 
nearity between *, and *, and the large standard errors reported. What the 
errors in variables analysis shows is that even very small errors in measure¬ 
ment can make a great deal of difference to the results, that is, the multicolli- 
nearity may be even more serious than what it appears to be and that the 
confidence intervals are perhaps wider than those reported. 

Note that the consistent bounds are not comparable to a confidence interval. 
The estimated bounds themselves have standard errors since 0, and 0, are both 
subject to sampling variation. If there are no errors in variables, even with 
multicollinearity. the least squares estimators are unbiased. But with errors in 
measurement they are not. and the estimated bounds yield (large sample) esti¬ 
mates of the biases. 

■ One important thing to note is that the bounds are usually much too wide 
and making some special assumptions about the errors, one can get a range of 
consistent estimates of the parameters. We will illustrate this point. 

Consider the regression of Y on *, and *, with the data in Table 7.3. Let us 
assume that stock formation is subject to error but *, is not. 7 
In this case we have rj 2 = 0.046. The regression of Y on *, and * 2 gives 

Y = 0.19!*, + 0.405*2 - 16.78 R 2 = 0.972 

(0.0087) (0.319) 

d 2 = 155.8 

The rtveitte Degression gives (note we use *2 as the regressand) 

*2 = 0.239T - 0.040*, + 6.07 R 2 = 0.139 

, We will apply the formulas in (11.6) and (11.7) except that we have to interchange the sub¬ 
scripts 1 and 2. - . • 
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which, when normalized with respect to Y, gives < 

Y = 0.169*, + 4.184*, - 25.35 
We thus get the bounds 

0.169 < 0, <0.191 and 0.405 < p, < 4.184 

The bounds for p, are very wide and we can learn more by studying the first 
equation in (11.4).* Suppose that X = 0.477. that is. the error variance is 47.7% 
of the variance of *,. which is a very generous assumption. In this case, a 
consistent estimate of p 2 is 2 x 0.405 = 0.81, which is far below the upper 
bound of 4.184. If X = 0, that is, there is no error in *> a consistent estimate 
of p 2 is, of course, 0.405. Since (1 - p 2 ) = 0.954 in this case, we have to have 
X = 0.86 or the error variance about 86 % of the variance of *, to say that a 
consistent estimate of p, is 4.184. 

Thus, in many problems, any analysis of bounds should be supplemented by 
some estimates based on some plausible assumptions of error variances, es¬ 
pecially when the bounds are very wide. Many of the comments made here 
regarding the use of bounds in errors in variables models apply to more general 
models. The purpose of analyzing a specialized model is to show some of the 
uses and shortcomings of the analysis in terms of bounds and these would not 
be as transparent when we consider a ^-variable model with all variables mea¬ 
sured with error . 9 


Two Explanatory Variables: Both Measured with Error 

Consider the case where both x, and x 2 are measured with error. Let the ob¬ 
served variables be 


Y = y + v *, = x, + «, *, = x, + «, 

We continue to make the same assumptions as before, that is, « 2 , v are 
mutually uncorrelated and uncorrelated with x,, x 2 , y. As before, we use the 
normalization var(*,) = I. var(* 2 ) = 1, and cov(*,, * 2 ) = p. Define 


_ var(«, 

1 var(*,) 


var(« 2 ) 
var(* 2 ) 


a 2 = var(e + v) 


We have the equation in observable variables: 


Y = p,*, + p 2 * 2 + tv 


‘Note that since it is at, that we are assuming to have an error, wc have to interchange the 
subscripts I and 2. 

’The case of only one variable measured with error has also been considered by M. D. Levi, 
“Measurement Errors and Bounded OLS Estimates." Journal of Econometrics, Vol. 6. 1977, 
pp. 165-171. However, considering the two variable case discussed here, it is easy to see that 
the formulae he gives depend on the variances and covariance of the true variables x, and jt., 
and ct;. We have presented the results in terms of the sample variances and covariance of X, 
and X ,, as done by Griliches and Ringstad. In this case all we have to do is to make some 
assumption about X, the proportion of error variance to total variance in X, to derive the 
bounds. 
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where 


Thus 


w » e + V - 0,K, - 02^2 


covt*,, w) = — 0|X, 
cov(* 2 . W) = - 0 2 X 2 

We thus have 

CT, y = cov(y, X,) = 0, + 0 2 p - 0|X| 

- coVCF.'-Xj) = 0,p + 02 - 0 2 X 2 (U-8) 

<rj = varflO = 0? + 0 2 + 2p0,0 2 + ct 2 - 0fX, - 0?X 2 

As before, we can derive the probability limits of 0, and 0 2 , the least squares 
estimators of 0, and 0 2 . We get 


plim 0, = 0, 


0 |X| p 0 2 X 2 

1 " P 2 


plim 02 = 02 - 



- P 0 .*i 

- p 2 


(n. 9 ) 


These equations correspond to (11.4) considered earlier (which were for 
X 2 = 0 ). 

We will not present the detailed calculations here (because they are tedious), 
but the equations corresponding to (11.5) obtained from a regression of X, on 
Y and Xj and normalizing with respect to Y are 


h* • ' l ■ 


i r. 


plim 0, = 01 + ^ [0 2 X 2 (0 2 + P0I - 02^2> + or 2 ] 
pliffl fe “ Pj “ ^ fPAiOi - pPi — 0|A|) + po 3 ] 


where' 


( 11 . 10 ) 


, A>¥»ft<l - p 1 - X|) + 0j\2P = 0id - P 2 ) - (0|X, - P0 2 X 2 ) 


[Substituting X 2 * 0, we get the results in (11.5).] 

If 0, and 02 are the estimators of 0, and 0 ; obtained from a regression of X 2 
on X, and Y, then plim 0, and plim 0, are obtained by just interchanging the 
subscripts 1 and 2 in equations (11.10). 

It is easy to see from equations (11.9) and (11.10) that the direction of biases 
are rather difficult to evaluate. Further, we saw earlier in the case of equations 
(11.4) and (11.5) that the derived bounds are in many cases rather too wide to 
be of any practical use and it is better to use equations (11.4) to get a range of 
consistent estimates based on different assumptions about X. The problem with 
equations (11.5) or (11.10) is that they depend on a further unknown param¬ 
eter CT 2 . 
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In this case, we cannot get any bounds on the parameters using the proce¬ 
dures of calculating probability limits as in equations (11.9) and (11.10). There 
is an alternative procedure due to Klepper and Learner 10 but its discussion is 
beyond our scope. 

Here we will illustrate the range of consistent estimates of p, and p 2 using 
equations (11.8) or (11.9) with different assumptions about X, and X 2 . This 
method is useful when we have some rough idea about the range of possible 
values for X, and x 2 . There is the problem, however, that not all these estimates 
are valid because the implied estimate of a 2 obtained from the third equation in 
(11.8) should be positive and also var(x,) var(jt 2 ) - [cov(x |t jc 2 )] 2 > 0. This last 
condition implies that (1 - X,)(l - X 2 ) - p 2 > 0. Using equations (11.9) we 
obtain consistent estimates pj and p 2 of p, and p 2 by solving the equations 

(1 — p" — X,)P, + P X,p 2 = (1 — p-)Pi (11.11) 

pX,pj + (1 - p 2 - X 2 )p 2 = (1 - p 2 )p 2 

The estimate of o 2 is obtained from the last equation in (11.8). It is 

a 2 = ai - (p ; 2 + p ; 2 + 2 P p;p; - x,p ; 2 - x 2 p 2 2 ) 01.12) 

As an example, let us consider the data in Table 7.3 and a regression of Y on 

X , and X 2 . The correlation coefficient between X, and X 2 is 0.2156. Hence p = 

O. 2156. Also, d„ = 11.9378 and d 2> = 3.2286. 

It is reasonable to assume that X, (gross domestic production) has smaller 
measurement errors than X 2 (stock formation). Hence we consider 

X, = 0.01,0.02, 0.05,0.10 * 

X 2 = 0.10,0.20, 0.30,0.40 

Note that for all these values, the condition (1 - X,)(l - X : ) - p 2 > 0 is 
satisfied. The results of solving equations (11.11) are presented in Table 11.1. 
The range of consistent estimates of p, is (0.128, 0.201) and the range of con¬ 
sistent estimates of p 2 is (0.447, 0.697). Note that the OLS estimates were 

P, = 0.191 and p, = 0.405. Thus with the assumptions about error variances 
we have made, the OLS estimate p 2 is not within the consistent bounds for p 2 
that we have obtained. 

Note that assuming only X 2 (and not A',) was measured with error we ob¬ 
tained earlier the bounds 

0.169 < p, < 0.191 and 0.405 < p 2 < 4.184 

The OLS estimates are at the extremes of these bounds. With the assumptions 
about error variances we have made, the bounds for p, are wider but the 
bounds for p 2 are much narrower. 

The important conclusion that emerges from all this discussion and illustra¬ 
tive calculations is that in errors in variables models, making the most general 

,0 S. Klepper and E. E. Learner, "Consistent Sets of Estimates for Regressions with Errors in 
All Variables,” Econometrica. Vol. 52, 1984, pp. 163-183. 
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Table 11.1 Estimates of Regression Parameters and 
Error Variance Based on Different Assumptions 
About Error Variances 


X, 

X, 

P, 

Ps 

U~ 

0.01 

0.10 

0.1817 

0.4519 

155.5 

0.01 

0.20 

0.1687 

0.5119 

155.5 

0.01 

0.30 

0.1516 

0.5903 

155.5 

0.01 

0.40 

0.1284 

0.6970 

155.4 

0.02 

0.10 

0.1837 

0.4514 

155.5 

0.02 

0.20 

0.1705 

0.5114 

155.5 

0.02 

0.30 

0.1533 

0.5898 

155.5 

0.02 

0.40 

0.1298 

0.6965 

155.4 

0.05 

0.10 

0.1898 

0.4500 

155.5 

0.05 

0.20 

0.1762 

0.5099 

155.5 

0.05 

0.30 

0.1585 

0.5882 

155.5 

0.05 

0.40 

0.1343 

0.6949 

155.4 

0.10 

0.10 

0.2010 

0.4473 

155.5 

0.10 

0.20 

0.1867 

0.5071 

155.5 

0.10 

0.30 

0.1680 

0.5853 

155.5 

0.10 

0.40 

0.1424 

0.6920 

155.4 


assumptions about error variances leads to very wide bounds for the parame¬ 
ters, thus making no inference possible. On the other hand, making some plau¬ 
sible assumptions about the variances of the errors in the different variables, 
one could get more reasonable bounds for the parameters. A sensitivity anal¬ 
ysis based on reasonable assumptions about error variances would be more 
helpful than obtaining bounds on very general assumptions. 


11.4 Reverse Regression 


In Sections 11.2 and 11.3 we considered two types of regressions. When we 
have the variables y and x both measured with error (the observed values being 
Y and X), we consider two regression equations: 

1. Regression of Y on X. which is called the “direct” regression. 

2. Regression of X on Y, which is called the “reverse" regression. 

Reverse regression has been frequently advocated in the case of analysis of 
salary discrimination." Since the problem is one of the usual errors in vari- 

K * 

"There is a large amount of literature on this issue, many papers arguing in favor of “reverse 
regression.” For a survey and different alternative models, see A. S. Goldberger, “Reverse 
Regression and Salary Discrimination.” Journal of Human Resources. Vol. 19. No. 3. 1984. 
pp. 293-318. 
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ables, both regressions need to be computed and whether reverse regression 
alone gives the correct estimates depends on the assumptions one makes. 

The usual model, in its simplest form, is that of two explanatory variables, 
one of which is measured with error. 

y = 0,jr, + + u (11.13) 

where y = salary 

x , = true qualifications 

= gender (in sex discrimination) 

= race (in race discrimination) 

What we are interested in is the coefficient of x 2 . The problem is that jr, is 
measured with error. Let 


X, = measured qualifications 


X, = x, + v 


Suppose we adopt the notation that 



for men 
for women 


Then 0, > 0 implies that men are paid more than women with the same quali¬ 
fications and thus there is sex discrimination. A direct least squares estimation 
of equation (11.13) with X, substituted for and 0 2 > 0 has been frequently 
used as evidence of sex discrimination. 

In the reverse regression 


= ■yi)' + y&2 + w (11.14) 

we are asking whether men are more or less qualified than women having the 
same salaries. The proponents of the reverse regression argue that to establish 
discrimination, one has to have -y, < 0 in equation (11.14); that is, among men 
and women receiving equal salaries, the men possess lower qualifications. 

The evidence from the reverse regression has been mixed. In some cases 
•Yj < 0 but not significant and in some others y 2 > 0- Conway and Roberts' 2 
consider data for 274 employees of a Chicago bank in 1976. In their analysis y 
= log salary and they get 0, = 0.148 (standard error 0.036), thus indicating 
that men are overpaid by about 15% compared with women with the same 
qualifications. In the reverse regression they get y = -0.0097 (standard error 
0.0202), thus showing no evidence of discrimination one way or the other. In 
another study by Abowd, Abowd, and Killingsworth," who compare wages for 
whites and several ethnic groups, the direct regression gave 0 2 > 0 and the 


1 ’Delores A. Conway and Harry V. Roberts. “Reverse Regression. Fairness and Employment 
Discrimination," Journal of Business and Economic Statistics, Vol. 1, January 1983. pp. 
75-85. 

“A. M. Abowd. J. M. Abowd. and M. R. Killingsworth. “Race. Spanish Origin and Earnings 
Differentials Among Men: The Demise of Two Stylized Facts.” NORC Discussion Paper 83- 
11, The University of Chicago, Chicago, May 1983. 
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indirect regression gave y 2 > 0 (indicating that whites are disfavored). Thus the 
direct regression showed discrimination and the reverse regression showed re¬ 
verse discrimination. 

Of course, in all these studies, there is no single measure of qualifications 
and .V, is a set of variables rather than a single variable. In this case what they 
do in reverse regression is take the estimated coefficients from the direct regres¬ 
sion and take the linear combination of these variables based on these esti¬ 
mated coefficients as the dependent variable and regress it on >• and x 2 . Since 
the direct regression gives biased estimates of these coefficients, what we have 
here is a biased index of qualifications. 

The usual errors in variables result in equations (11.4) and (11.5) show that 
one should not make inferences on the basis of p, and y 2 but obtain bounds for 
p ; . from the direct regression and reverse regression estimates. As shown in 
equation (11.7), these bounds depend on the sign of pp,, where p = correlation 
between X , and x 2 . Normally, one would expect p > 0 and p, > 0 and hence 
we have 

plim p 2 c p 2 < plim 0, 

1 i ' 

where p, is the implied estimate of p ; from the reverse regression. 

Note also from equations ( 11 . 4 ) and ( 11 . 5 ) that the (asymptotic) biases de¬ 
pend on two factors: X = o*/var(A',) and p = correlation between X\ and jc 2 . 
X is unknown but p can be computed from the data. Thus one can generate 
different estimates of p 2 from these equations based on different assumptions 
about the value of We will not, however, undertake this exercise here. 

11.5 Instrumental Variable Methods 


Consider equation (11.2). The reason we cannot use OLS is because the error 
w, is correlated with jr,. The instrumental variable method consists of finding a 
variable z, that is uncorrelated with w, but correlated with x„ and estimating p 
by Piv = 2 x ' z >- The variable z, is called an “instrumental variable.” 

Note‘that in the usual regression model y = px + h\ the normal equation 
for the OLS estimation of p is 

2;4y .-.fc)» 0 (11.15) 

This is the sample analog of the assumption we make that cov(jc. u) = 0. If this 
assumption is violated, we cannot use the normal equation (11.15). However, 
if we have a variable z such that cov(z, w) = 0, we replace the normal equation 
(11.15) by ‘' , - 

,£z(y.^$r) * 0 
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which is the instrumental variable (IV) estimator. We can show that the IV 
estimator is consistent. 


* ,.2 

plirn p, v = plim 


= 0 + Plim [ (^ S **] 


= cov(z L w) = 
p cov(z, X) p 


since cov(z, w) = 0 and cov(z. jc) # 0. The reason we want z to be uncorre¬ 
lated with w but correlated with x is that we want cov(z, u ) = 0 but cov 
(z, x) t* 0 . It is often suggested that z is a “good" instrument if it is highly corre¬ 
lated with x. 

In practice it is rather hard to find valid instrument variables. Usually, the 
instrumental variables are some variables that are “around," that is, whose 
data are available but do not belong in the equation. An illustration of this is 
the study by Griliches and Mason 14 who estimate an earnings function of the 
form 


y = a + 0$ 4 ya 4 bx 4 u 

where y is the log wages, s the schooling, a the ability, x the other variables, 
and u the error term. They substituted an observed test score t for the unob¬ 
served ability variable and assumed that it was measured with a random error. 
They then used a set of instrumental variables such as parental status, regions 
of origin, and so on. The crucial assumption is that these variables do not be¬ 
long in the earnings function explicitly, that is, they enter only through their 
influence on the ability variable. Some such assumption is often needed to jus¬ 
tify the use of instrumental variables. 

In the Griliches-Mason study the OLS estimation (the coefficients of the 
“other variables” x are not reported) gave 

y = const. + 0.1982(race) 4 0.0331 (schooling) 4 0.00298(test score) 

(0 0458) (0 0067) 10 00038) 

(Figures in parentheses are standard errors.) The IV estimation gave 

y = const. 4 0.0730(race) 4 0.0483(schooling) 4 0.00889(test score) 

(0.0468) (0.0065) (0 00078) 

The IV estimation thus gave a much higher estimate of the ability coefficient 
and lower estimate of the race coefficient. 

In the case of time series data, lagged values of the measured X, are often 
used as instrumental variables. If X, = x, 4 u, and the measurement errors u, 

M Z. Griliches and W. M. Mason. “Education, Income and Ability,” Journal of Political Econ¬ 
omy. Vol. 80. No. 3. Part 2, May 1972. pp. S74-SI03. 
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are serially uncorrelated but the true v, are serially correlated, then X, , can 
be used as an instrumental variable This suggestion was made as early as 1941 
by Riersol ■' 

As an illustration, consider the model 

Y, = Pt, + e, 

X, = x, + u, 

X, = P*,-i + v, 

(There is no error of observation in Y,.) Then 

varfX,) = cr„ + ar m 
var (Y,) = p 2 cr„ + a„ 
cov(X„ Y,) = Pct„ 

COV(X„X,-i) = pCT t , 

cov(T„ y,_,) = p 2 po„ 

co\(Y h X,_,) = cov(A’„ T,„,> = Ppa„ 

Thus we can estimate p by 

& cov( Y n X ,_,) - cov(y„ T,_,) 

Pi = cov(*„ ° cov(X„ Y,_ t ) 

The former amounts to using X, , as the instrumental variable, and the latter 
amounts to using Y, , as the instrumental variable 
One other instrumental variable method is the method of grouping Three 
main grouping methods have been suggested in the literature, by Wald, Bart¬ 
lett, and Durbin 1(1 In Wald’s method we rank the A”s and form those above the 
median X into one group and those below the median into another group If 
the means in the two groups are. respectively, f,. X, and Y 2 , X 2 , we estimate 
the slope p by 

p* = (t 2 - - *,) 

This amounts to using the instrumental variable 

_ f 1 if X, > median 

^ [ -1 if X, < median 

”0 Riersol. “Confluence Analysis by Means of Lag Moments and Other Methods of Conflu¬ 
ence Analysis " Econometnca 1941 pp 1-24 

“A Wald “The Fitting of Straight Lines If Both Variables Are Subject to Errors ’ Annals of 
Mathematical Statistics 1940, pp 284-300 M S Bartlett Fitting of Straight Lines When 
Both Variables Are Subject to Error Biometrics 1949 pp 207-212, J Durbin, Errors in 
Variables,” Review of International Statistical Institute 1954 pp 23-32 
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and using the estimator p* = 2 X Bartlett suggested ranking X's, 

forming three groups, and discarding the nl 3 observations in the middle group. 
His estimator of p is 


P* 


r, - f, 

Xy ~ X, 


This amounts to using the instrumental variable 


*-{-! 


for the top nl 3 observations 
for the bottom nl 3 observations 


Durbin suggests using the ranks of X, as the instrumental variables. Thus 

1iY, 


p* = 


liX, 


where the X, are ranked in ascending order and the Y, are the values of Y cor¬ 
responding to the X t . If the errors are large, the ranks will be correlated with 
the errors and the estimators given by Durbin's procedure will be inconsistent. 
Since the estimators given by Wald's procedure and Bartlett's procedure also 
depend on the ranking of the observations, these estimators can also be ex¬ 
pected to be inconsistent. Pakes 17 investigates the consistency properties of the 
grouping estimators in great detail and concludes that except in very few cases 
they are inconsistent, quite contrary to the usual presumption that they are 
consistent (although inefficient). 

Thus the grouping estimators are of not much use in errors in variable 
models. 


11.6 Proxy Variables 


Often, the variables we measure are surrogates for the variables we really want 
to measure. It is customary to call the measured variable a "proxy" variable— 
it is a proxy for the true variable. Some commonly used proxies are years of 
schooling for level of education, test scores for ability, and so on. If we treat 
the proxy variable as the true variable with a measurement error satisfying the 
assumptions of the errors in variables model we have been discussing, the anal¬ 
ysis of proxy variables is just the same as the one we have been discussing. 
However, there are some other issues associated with the use of proxy vari¬ 
ables that we need to discuss. 

Sometimes, in multiple regression models it is not the coefficient of the proxy 
variable that we are interested in but the other coefficients. An example of this 
is the earnings function in Section 11.5, where our interest may be in estimating 
the effect of years of schooling on income and not the coefficient of ability (for 

17 Ariel Pakes. “On the Asymptotic Bias of Wald-TVpe Estimators of a Straight Line When Both 
Variables Are Subject to Error," International Economu Review, Vol. 23. 1982. pp. 491-497. 
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Which we use the test score as a proxy). In such cases one question that is often 
asked is whether it is not better to omit the proxy variable altogether. To sim¬ 
plify matters, consider the two-regressor case; a" 

y = px + yz + u (11.16) 


where jc is observed but z is unobserved. We make the usual assumption about 
x, z and the error term u. Instead of z we observe a proxy p = z + e. As in 
the usual errors-in-variables literature, we assume that e is uncorrelated with 
Z, x and u. Let the population variances and covariances be denoted by M„, 
M„, M xp , and so on. >■•: 

If we omit the variable z altogether from (11.16) and the estimator of 3 is 
then using the omitted-variable formula [Equation (4.16)], we get 


• pM® 0* *= P + 7 ^ 


(11.17) 


On the other hand, if we substitute the proxy p for z in equation (11.16) and 
the estimator of 3 is 0 P , it can be shown that 18 


plim 0 = 3 + 7 


M„ _ 

+ A/„(l - p 2 ) 


(11.18) 


where p is the correlation between x and z. The bias due to omission of the 
variable z is y(M t JM tx ). The bias due to the use Of the proxy p for z is the same 
multiplied by the ratio in brackets of (11.18). Since this ratio is less than I, it 
follows that the bias is reduced by using the proxy. McCallum and Wickens 
have eugued’on the basis of this result that it is desirable to use even a poor 
proxy. 

However, there are four major qualifications to this conclusion: 


1. One should look at the variances of the estimators as well and not just at 
bias. Aigner lv studied the mean-square errors of the two estimators 
3 ov and 0 and derived a sufficient condition under which MSE(0) s 
MSE(3o V ). This condition is 


9 


1 - (1 - Mp 2 X 

tl - (1 - Wp 2 ! 2 * 


>P* 


where 


X = 




o 2 + 


“G. S. Maddala, Econometrics (New York: McGraw-Hill. 1977), p. 160. This result was derived 
in B. T. McCallum. “Relative Asymptotic Bias from Errors of Omission and Measurement,” 
Econometnea. July 1972. pp 757-758, and M. R. Wickens. “A Note on the Use of Proxy Vari¬ 
ables,” Econometrica, July 1972, pp. 365-372. 

"D. J. Aigner, “MSE Dominance of Least Squares with Errors of Observation,” Journal of 
Econometrics, Voi. 2, 1974. 365-372: 
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and p is defined earlier. From this he concludes that the use of a proxy, 
although generally advisable may not be a superior strategy to dropping 
the error-ridden variable altogether. Kinal and Lahiri 20 analyze this prob¬ 
lem in greater detail and generality. They give several alternative expres¬ 
sions to Aigner’s. They conclude that including even a poor proxy is 
advisable under a wide range of empirical situations when the alternative 
is to discard it altogether. 

2. The second major qualification is that the proxy variable does not always 
fall in the pure errors-in-variables case. Usually, the proxy variable is 
“some variable” that also depends on the same factors, that is, p is of 
the form 


p = ax + 8z + e 

Since z is unobserved and does not have any natural units of measure¬ 
ment, we will assume that 8=1. We can then write 

p = cue + z + e (11.19) 

We will also assume that M ue is not zero. Now it does not necessarily 
follow that including the proxy p leads to a smaller bias in the estimator 
of ^ in equation (11.16). 21 Thus, except in cases where the proxies fall in 
the category of pure errors in variables, it does not follow that using even 
a poor proxy is better than using none at all. 

3. The third qualification is that the reduction-in-bias argument does not ap¬ 
ply if the proxy variable is a dummy variable. 22 This is the case where we 
do not observe z but we know when it is in different ranges. For instance, 
we do not know how to measure “effective education” but we use dum¬ 
mies for the amount of education (e.g., grade school, high school, col¬ 
lege). In this case it does not necessarily follow that using the proxy 
results in a smaller bias compared with the omission of z altogether. 

4. The reduction-in-bias argument also does not apply if the other explana¬ 
tory variables are measured with error. 23 In equation (11.16) suppose that 
the variable x is measured with error so that what we observe is X = 
x + v. We will assume that cov(jc, v) = cov(jc, z) = cov(v, e) = 0. We 
can consider two estimates of p, one using the proxy p and the other 
omitting it. Now we cannot say anything about the direction of the biases, 
nor about whether the bias will increase or decrease with the introduction 
of p. 24 

20 T. Kinal and K. Lahiri, “Specification Error Analysis with Stochastic Regressors,” Econo- 
metrica, Vol. 51, 1983. pp. 1209-1219. 

2l Maddala, Econometrics, p. 161. 

22 See Maddala, Econometrics, pp. 161-162, for an example due to D. M. Grether. 

23 See Maddala, Econometrics, pp. 304-305 referring to the results from the papers by Welch 
and Griliches on estimation of the effects of schooling on income. 

24 Finis Welch, “Human Capital Theory: Education. Discriminations and Life Cycles,” Ameri¬ 
can Economic Review, May 1975, p. 67; Zvi Griliches, “Estimating the Returns to Schooling: 
Some Econometric Problems,” Econometrica, Vol. 45, 1977, p. 12. 
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Coefficient of the Proxy Variable 

The preceding discussion referred to a situation where our interest was in the 
coefficients of variables other than the proxy variable. There are also many 
situations where our interest is in -y the coefficient of the unobserved variable 
in (11.16). Since z is not observed, we cannot think of any natural units of 
measurement for z. Thus it is not the magnitude of -y but the sign of y with 
which we are concerned. A question we would like to ask is under what con¬ 
ditions the use of the proxy p will give us the correct sign for -y. To answer this 
question we need a method of combining subjective assessments of how good 
the proxies are with objective information on the observed variables. 

This problem has been analyzed by Krasker and Pratt. 25 Consider equa¬ 
tion (11.16). Let the correlation coefficient between the unobserved variable z 
and the proxy variable pbe r*. The condition that the coefficient of the proxy 
variables has the same sign as y the coefficient of the unobserved variable in 
(11.16) is 

(r*Y>Rl x + 1 - (11.20) 

As an example they consider the determinants of motor vehicle deaths. The 
data are cross-section data in 1960 for the 48 contiguous states in the United 
States. The variables are: 

y, = logarithm of the number of motor vehicle deaths per 
capita in the ith state in 1960 

x,, = dummy variable defined to be 1 if the /'th state had 
mandatory motor vehicle inspection, 0 otherwise 

x 2l = logarithm of per capita gasoline consumption 
in state /' in 1960 

x 3 , = logarithm of the fraction of the /'th state’s 
population that was 18-24 in 1960 

x Al = logarithm of the number of automobiles (per capita) 
older than 9 years in state /' in 1960 

The estimated regression equation (with standard errors in parentheses) is 
y, = -4.53 - 0.23X,, + 1.17x 2 , + 1.49x 3l + 0.04 x 4i 

( 1 . 77 ) (0 06 ) (0 23 ) (0 37 ) (0 12 ) 

x 2 , is considered a proxy variable for an unobserved variable z which is “per 
capita exposure to situations that create the possibility of fatal accidents.” The 


25 W. S. Krasker and J. W. Pratt, “Bounding the Effects of Proxy Variables on Regression Coef¬ 
ficients,” Econometrica, Vol. 54, 1986, pp. 641-655. 
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question is whether the proxy has the same sign as the coefficient of this unob¬ 
served variable z. The R 2 from a regression of x 2 , on x u , jc 3 „ jc 4 , is 0.3895. The 
R 2 from a regression of x 2 , on y„ x,„ jt,„ x 4 , is 0.6193. Hence, according to 
(11.20), we can be sure that the coefficient of x 2 , has the same sign as the coef¬ 
ficient of z, regardless of other correlations if we are sure that the correlation 
between x 2 , and z exceeds 

(0.3895 + 1 - 0.6193) ,/2 = 0.878 

Krasker and Pratt conclude that “To ensure that the signs of the coefficients 
coincide with the signs of the unobservable true variables, the proxies must be 
of much higher quality than could be hoped for in the actual context.” 

Krasker and Pratt also give alternative formulas and methods of determining 
the correctness of the sign for the other coefficient [3 in (11.16) as well. Since 
these expressions are somewhat complicated, we will not reproduce them here. 
The condition given in (11.20) would enable us to judge the sign of -y. One 
reason why they get such a stringent condition is that they relax the usual 
assumptions made in the errors-in-variables models. They do not make the 
assumption that the error e in p is independent of z, x, or u. 

To compute the Krasker-Pratt criterion (11.20) we have to compute the R 2 's 
from a regression of the proxy on the other explanatory variables, and a regres¬ 
sion of the proxy on the other explanatory variables and the dependent vari¬ 
able. 


11.7 Some Other Problems 


We introduced the simple errors in variables model in Section 11.2 as a starting 
point of our analysis. This simple model may not be applicable with most eco¬ 
nomic data because of the violation of some of the assumptions implied in this 
simple model. The crucial assumptions made in that model are: 

1. The errors have zero mean. 

2. The covariances between the errors and the systematic parts are zero. 

3. The covariances between the errors themselves are zero. 

Based on these conclusions, we showed that the OLS estimator (3 is not 
consistent and is biased toward zero. We will now show that: 

1. The problem of obtaining consistent estimators can be solved in some 
cases if we have more equations in which the same error-ridden variable 
occurs. 

2. If we consider correlated errors, the least squares estimators need not be 
biased toward zero. In fact, the OLS estimator 0 may overestimate 
(rather than underestimate) (3. 

Thus the conclusions derived in Section 11.2 will not be correct. 
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The Case of Multiple Equations 

The case of multiple equations can be illustrated as follows. Suppose that 

yt = Pi* + 
y 2 = P 2 X + £>., 

x is not observed. Instead, we observe X = x + u. Suppose that e„ e 2 , and u 
are mutually uncorrelated and also uncorrelated with x. Also, let var(c,) = 
ct], var(c 2 ) = o], var(w) = of„ and var(x) = a]. Then we have 

var(y,) = p]o] + «•] 

var(.y 2 ) = p]cr] + a] 


cov(y„ y 2 ) = P,P 2 cr] 
cov(y„ X) = p,cr] 
cov(y 2 , X) = p 2 a] 


var(A) = ct] + a] 

These six equations can be solved to get estimates of p,, p 2 , o], o'], «t], and a\. 
Specifically, we have 


coyfy,, y 2 ) 
cov(^T, y 2 ) 


and 


a = cov(y 2 , >’,) 
2 cov(A", y,) 


In effect, this is like using >s as an instrumental variable in the equation y, = 
PiA" + w, and y, as an instrumental variable in the equation y 2 = p 2 A + w 2 . 
Further elaboration of this approach of solving the errors-in-variables (and 
unobservable variables) problem by increasing the number of equations can be 
found in Goldberger. 26 

As an illustration, suppose that 


y, = expenditures on automobiles 


y 2 = expenditures on other durables 


x = permanent income 
X = measured income 

If we are given only y, and X (or y 2 and X), we are in the single-equation errors- 
in-variables problem and we cannot get consistent estimators for p, (or p 2 ). But 
if we are given y„ y 2 , and X, we can get consistent estimators for both p, 
and p 2 . 


26 A. S Goldberger, “Structural Equation Methods in the Social Sciences,” Econometriia, 
1972, pp 979-1002. 
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Correlated Errors 


Until now we have assumed that the errors of observation are mutually uncor¬ 
related and also uncorrelated with the systematic parts. If we drop these as¬ 
sumptions, things will get more complicated. For example, consider the model 
y = 0x + e. 

The observed values are X = x + u and Y = y + v, where u and v are the 
measurement errors. Let o n denote the covariance between x and y, with a 
similar notation for all the other covariances. If the least squares estimate of (3 
from a regression of Y on X is 0, then 

- cov(T, X) covfx + u), (y + v) 

Pl,m P = var(X) . = varfx +'u) 

_ + CT jn + Vy,, + Cr,„ 

CT tv + 2o- t „ + fj„„ 

_ 0O'o + + O',,, + O',,,, 

CT tl + 2(7+ rr„, 

Since o\„ = 0(T ( „ and cr xl , = cov[(y - e)/0, v] = o\,/0, we have 


plim 0 = 


P(g„ + Oj + CT V ,/P + CT„ 
O'.. + 2o\„ + (7„„ 


Now even if there is no error in x (i.e., u = 0), we find that 0^0 since o u ^ 
0. Thus it is not just errors in x that create a problem as in the earlier case. 

One can calculate the nature of the bias in 0 making different assumptions 
about the different covariances. We need not pursue this further here. What is 
important to note is that one can get either underestimation or overestimation 
of 0. With economic data where such correlations are more the rule than an 
exception, it is important not to believe that the slope coefficients are always 
underestimated in the presence of errors in observations, as is suggested by 
the classical analysis of errors-in-variables models. 

We have all along omitted the intercept term. If there is an intercept term a, 
i.e., our true relationship is y = a + 0x + e, and instead we estimate Y = 
a + fiX + w, then the least squares estimator 0 underestimates 0 and conse¬ 
quently the least squares estimator a will overestimate a. If, however, the er¬ 
rors do not have a zero mean [i.e., X = x + u and E(u) ¥= 0], these conclusions 
need not hold. 


Summary 


1. In the single-equation model with a single explanatory variable that is 
measured with error, the least squares estimator of 0 underestimates the true 
0. Specifically, the bias is — 0A, where A is the proportion of the error variance 
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in the variance of x. This result is based on the assumption that the errors have 
zero means and have zero covariance with the systematic parts and among 
themselves. 

2. We can obtain bounds for the true coefficient p by computing the regres¬ 
sion coefficient of y on x and the reciprocal of the regression coefficient of x on 
y (Section 11.2). 

3. In a model with two explanatory variables x, and x 2 with coefficients of 
Pi and p 2 where only x, is measured with error, we can show that the bias in 
the estimator of p, is (— p,X)/( 1 - p 2 ), where p is the correlation between the 
measured values of x, and x 2 . Also, the bias in the estimator of p 2 = — p (the 
bias in the estimator of p,). Similar results can be derived when there are many 
explanatory variables. Some papers in the literature derive the expressions for 
the bias in terms of the correlations between the true unobserved variables. 
These expressions are not very useful in practice. Here we derive the expres¬ 
sions in terms of the correlations of the observed variables. The only unknown 
factor is X, the proportion of error variance in the variance of the error-ridden 
variable (Section 11.3). 

4. As with the model with a single explanatory variable, we can derive 
bounds for the true coefficients by running two regressions. These bounds are 
given in equations (11.6) and (11.7). However, these bounds are not comparable 
to confidence intervals. The estimated bounds themselves have standard er¬ 
rors. We have illustrated with an example that these bounds can sometimes, be 
so wide as to be almost useless. In many problems, therefore, it is better to 
supplement them with estimates based on some plausible assumptions about 
the error variances. 

5. In the model with two explanatory variables, if both the variables are 
measured with error, the direction of biases in the OLS estimators cannot be 
easily evaluated [see equations (11.9) and (11.10)]. Making the most general 
assumptions about error variances often leads to wide bounds for the parame¬ 
ters, thus making no inference possible. On the other hand, making some plau¬ 
sible assumptions about the error variances, one can get more reasonable 
bounds for the parameters. This point is illustrated with an example. 

6. In the application of the errors-in-variables model to problems of dis¬ 
crimination, the “reverse regression” has often been advocated. The argu¬ 
ments for and against this procedure are reviewed in Section 11.4. 

7. One method for obtaining consistent estimators for the parameters in 
errors-in-variables models is the instrumental variable method. In practice it is 
rather hard to find valid instrumental variables. In time-series data lagged val¬ 
ues of the measured x, are often used as instrumental variables. Some grouping 
methods are often suggested for the estimation of errors-in-variables models. 
These methods can be viewed as instrumental variable methods. However, 
their use is not recommended. Except in special cases, the estimators they 
yield are not consistent. 

8. Often in econometric work it is customary to use some surrogate vari¬ 
ables for the variables we cannot measure. These surrogate variables are called 
proxy variables. In an equation like 
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U ERRORS IN VARIABLES 


y = P^r + yz + u 

where we use a proxy p for z, a question that has often been asked is whether 
it is better to leave out p completely if we are interested in estimating p. As¬ 
suming that p falls in the category of the usual errors in variables, it has been 
shown that the bias in the estimator for p is reduced if we leave p in the equa¬ 
tion. On the basis of this result it has been argued that even a poor proxy is 
better than none. However, this conclusion does not necessarily hold good if 
we take into account the variances of the estimators, or if the proxy is a dummy 
variable, or if the proxy does not fall in the category of the usual errors in 
variables. The conclusion also does not follow if x itself is measured with error. 

9. In case we are interested in the coefficient y, a question often arises 
whether the use of the proxy p gives us the correct sign. A criterion for deter¬ 
mining this is given in equation (11.20). 

10. If there are many explanatory variables that depend on the same error- 
ridden variable, sometimes we can use the dependent variables in the other 
equations as instrumental variables for the estimation of the parameters in each 
equation. An example of this is given in Section 11.7. Furthermore, all these 
assumptions about bias are invalid in the presence of correlated errors (see 
Section 11.7). 

Exercises 


1. Explain the meaning of each of the following terms. 

(a) Errors in variable bias. 

(b) Bounds for parameters. 

(c) Reverse regression. 

(d) Grouping methods. 

(e) Proxy variables. 

2. Examine whether the following statements are true, false, or uncertain. 
Give a short explanation. If a statement is not true in general but is true 
under some conditions, state the conditions. 

(a) Errors in variables lead to estimates of the regression coefficients that 
are biased toward zero. 

(b) In errors-in-variables models we can always get two estimators 
p, and p 2 such that 

plim Pi < p < plim p 2 

Thus, even though we cannot get a confidence interval, we can get 
bounds for the parameters that will serve the same purpose. 

(c) Grouping methods give consistent (although inefficient) estimators for 
the regression parameters in errors-in-variables models. Since they 
are very easy to compute, they should be used often. 

(d) It is always desirable to use even a poor proxy than to drop an error- 
ridden variable from an equation. 
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(e) If we have an unobserved variable in an equation and we are inter¬ 
ested in the sign of its coefficient, we should always use a proxy, since 
the estimated coefficient of the proxy variable will give us the correct 
sign. 

(f) In regressions of income on schooling where schooling is measured 
with error, omitting variables that measure ability will overestimate 
the effect of schooling on income. 

(g) In part (f) the bias in the estimator of the effect of schooling on income 
can be reduced if we include a proxy like test score for ability. 

(h) In part (g), if ability is measurable, then we should always include the 
measure available, in the earnings function. 

(i) In an analysis of discrimination in salaries an investigator finds that 
the direct regression shows discrimination, whereas the reverse 
regression shows reverse discrimination. This proves that there is no 
evidence of discrimination in salaries. 

3. Consider a regression model 

y = a 0 + a,X| + a 2 x 2 + (X 3 X 3 + u 

The variable x, is not observed, but we use a proxy p for it. Let a, be the 
estimator of a, obtained from the multiple regression equation. If you are 
told that the correlation between x, and p is at least 0 . 8 , explain how you 
will determine whether &, has the correct sign (same sign as a,). 

4. Consider the regression model 

y = p,x, + p 2 x 2 + u 

x, is not observed. The observed value is X x = x, + e, where e is uncorre¬ 
lated with x, ( x 2 , and u. Let y = var(c)/var(A',). Suppose that we drop x 2 
from the equation. Let the OLS estimator of p, be p,. Show that 

plim p, = p, - yP, + PA, 

where b 2l is the regression coefficient from a regression of x 2 on X x . If p, is 
the estimator of p, from a regression of y on X t and x 2 , show that 

plim P, = p, - — ‘1 

where r is the correlation between X t and x 2 . [This is equation (11.4) derived 
in the text.] 

5. In Exercise 4 compute the two probability limits if the true equation is 

y = l.Ox, + 0.5x 2 + u 

y = 0.1 varlA’,) = var(x 2 ) = 9 r = 0.5 
What do you conclude from these results? 
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12.1 Introduction 


The early developments in econometrics were concerned with the problems of 
estimation of an econometric model once the model was specified. The major 
preoccupation of econometricians was with devising methods of estimation 
that produced consistent and efficient estimates of the parameters. These are 
the methods we discussed in the previous chapters. Sometimes more simplified 
methods like two-stage least squares were suggested because some other meth¬ 
ods like limited-information maximum likelihood (LIML) were complicated to 
compute. With the recent developments in computer technology, such search 
for simplified methods of estimation is no longer worthwhile, at least in a ma¬ 
jority of problems. 

During recent years the attention of econometricians has been diverted to 
the problems of: 

1. Checking the adequacy of the specification of the model. This is called 
“diagnostic checking” and “specification testing.” 

2. Choosing between alternative specifications of the model. This is called 
“model selection.” 

3. Devising methods of estimation based on weaker assumptions about the 
error distributions. This is called “semiparametric estimation.” 

The last area is well beyond the scope of this book and will not be discussed 
at all. The first two areas are also very vast in scope. Hence what we will do 
is to consider only a few of the major developments. 

It is, of course, true that the problem of diagnostic checking has not been 
completely ignored. In fact, we discussed some tests for this purpose in the 
earlier chapters. For instance: 

1. Tests for parameter stability in Chapter 4. 

2. Tests for heteroscedasticity in Chapter 5. 

3. Tests for autocorrelation in Chapter 6. 

However, these tests are all based on least squares residuals, and during 
recent years some alternative residuals have been suggested. Also, tests for 
diagnostic checking have been more systematized (and incorporated in com¬ 
puter programs like the SAS regression program). The limitations of some stan¬ 
dard tests have been noticed and further modifications have been suggested. 
For instance, in Chapter 6 we discussed the limitations of the DW test, which 
has been the most commonly used “diagnostic test” for many years. 

We begin our discussion with diagnostic tests based on the least squares re¬ 
siduals. Here we summarize the tests discussed in the earlier chapters and dis¬ 
cuss some other tests. We will then present some alternatives to least squares 
residuals and the diagnostic tests based on them. Next, we discuss some prob¬ 
lems in model selection and specification testing. Many of the specification 
tests involve “expanded regressions,” that is, the addition of residuals or some 
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other constructed variables as regressors to the original model and testing for 
the significance of the coefficients of these added variables. Thus most of these 
tests can be easily implemented using the standard regression packages. 

12.2 Diagnostic Tests Based on Least 
Squares Residuals 

Diagnostic tests are tests that are meant to “diagnose” some problems with the 
models that we are estimating. The least squares residuals play an important 
role in many diagnostic tests. We have already discussed some of these tests, 
such as tests for parameter stability in Chapter 4, tests for heteroscedasticity 
in Chapter 5, and tests for autocorrelation in Chapter 6 . Here we discuss two 
other tests using least squares residuals. 

Tests for Omitted Variables 

Consider the linear regression model 

y, = P*, + «, 

To test whether the model is misspecified by the omission of a variable z„ we 
have to estimate the model 


y, = P*, + TZr + Mf 
and test the hypothesis 7 = 0 . 

If data on z, are available, there is no problem. All we do is regress y, on x, 
and z, and test whether the coefficient of z, is zero. Suppose, on the other hand, 
that we use the following procedure: 

1. Regress y, on x, and get the residual u,. 

2. Regress u, on z,. Let the regression coefficient be y. Test the hypothesis 
7 = 0 using this regression equation. 

What is wrong with this procedure? 

The answer is that 7 is an inconsistent estimator of 7 unless 7 = 0 . Further¬ 
more, the distribution of 7 is complex and the standard errors provided by the 
least squares estimation of step 2 will not be the correct ones. If the least 
squares residuals at step 1 are to be used, we should regress them on z, and x, 
and then test whether the coefficient of z, is zero . 1 

Thus if we are to use the residuals it, from a regression of y, on x, for testing 
specification errors caused by omitted variables, it is advisable to regress u, on 
z, and x, and not z, only, and test the hypothesis that the coefficient of z, is zero. 


‘We are omitting the detailed proofs. They can be found in the paper by A. R. Pagan and A. D. 
Hall, “Diagnostic Tests as Residual Analysis” (with discussion). Econometric Reviews, Vol. 2, 
1983, pp. 159-218. 
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When observations on z are not available, we use a proxy z for z- The pre¬ 
ceding results apply to this case as well . 2 That is, an appropriate test for omit¬ 
ted variables is to estimate the model 


= P-r, + yz, + V, 

and test the hypothesis 7 = 0 . Alternatively, if we are using the residuals u, 
from a regression of y, on x„ we have to regress it, on z, and x, and then test 
whether the coefficient of z, is zero. Of course, we need not go through this 
circuitous procedure, but the reason for discussing this is to make clear the 
distinction between tests for heteroskedasticity discussed in Chapter 5 and 
tests for omitted variables. 

Ramsey 3 suggests the use of y 2 , y 3 and yf as proxies for z„ where y, is the 
predicted value of y, from a regression of y, on x r 
The test procedure is as follows: 

1. Regress y, on x, and get y,. 

2. Regress y, on x„ y 2 , y 3 and yf and test the hypothesis that the coefficients 
of the powers of y, are zero. 

Note that this test is slightly different from Ramsey’s test for heteroskedas¬ 
ticity discussed in Section 5.2. That test proceeds as follows: 

1. Regress y, on x, and get the residual u,. 

2, Regress u, on y 2 , y 3 , and yf and test that the coefficients of these variables 
are zero. 

As explained earlier, if we want to use u, to test for omitted variables, we 
have to include jc, as an extra explanatory variable. This is the difference be¬ 
tween the two Ramsey tests. 

Tests for ARCH Effects 

In econometric models, the uncertainty in the economic relationship is cap¬ 
tured by the variance a 2 of the error term It has been found that it is impor¬ 
tant to model this error variance because it affects the behavior of economic 
agents. One such model is the ARCH model (Autoregressive Conditionally 
Heteroscedastic model) suggested by Engle . 4 In this model the unconditional 
variance £(w 2 ) is constant but the conditional variance E(u]\x t ) is not. Denoting 
this conditional variance by 07 , the model suggested by Engle is 

of = a 2 + y« 2 _i 7 > 0 


The only problem is that if z is uncorrelated with z, the test would have low power, but cases 
where z is uncorrelated with z are rare. 

’J. B. Ramsey, "Tests for Specification Errors in Classical Linear Least Squares Regression 
Analysis,” Journal of the Royal Statistical Society, Series B. Vol. 31. 1969, pp. 350-371. 

“R. F. Engle, “Autoregressive Conditional Heteroscedasticity with Estimates of the Variance 
of United Kingdom Inflation," Econometrka, Vol. 50, 1982, pp. 987-1007. See also Section 

6 . 11 . 
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that is, the variance of the current error term u, is higher if the past error term 
is higher. An unusually high disturbance term in one period results in an in¬ 
crease in uncertainty for the next period. If y = 0, there is no ARCH effect 
and the usual methods of estimation apply. If y ^ 0, we have to use more 
complicated maximum likelihood procedures to estimate the model. 

A test for ARCH effect using the least squares residuals proceeds as follows: 

1. Regress y, on x,. Obtain 

2. Regress uj on zz 2 _, and test whether the regression coefficient is zero. 

A large number of studies, particularly those of speculative prices, have re¬ 
ported significant ARCH effects. However, one has to be careful in interpreting 
the results because zz 2 _, might be acting as a proxy for omitted lagged values 
of y, and x, from the equation. (Note that u, , = y, , — pjc,_,.) Thus the ARCH 
test should be performed after including a sufficient number of lagged values 
of y, and jc, in the equation. 5 

12.3 Problems with 

Least Squares Residuals 

In our discussion of heteroskedasticity in Chapter 5 and autocorrelation in 
Chapter 6, we considered only the least squares residuals u, obtained from the 
least squares regression. The problem with these residuals, as we will demon¬ 
strate, is that they are heteroskedastic and autocorrelated even z/the true errors 
have a common variance and are serially independent. This heteroskedasticity 
and autocorrelation depends on the particular values of the explanatory vari¬ 
ables in the sample. That is why the Durbin-Watson test gives an inconclusive 
region. Durbin and Watson obtained lower and upper bounds for their test sta¬ 
tistic that are valid irrespective of what values the explanatory variables take. 
Presumably, the bounds can be improved with a knowledge of the explanatory 
variables. In fact, we argued that with most economic variables, the upper limit 
is the one to use. An altei native solution is to construct residuals that have the 
same properties as the true errors. That is, if the errors have mean zero and 
constant variance a 2 and are serially independent, the residuals also exhibit 
these same properties. One such set of residuals is the “recursive residuals” 
which are discussed in Section 12.4. First we demonstrate the problems with 
the least squares residuals. 

During recent years separate books have been written on just the topic of 
residuals. 6 This is because an analysis of residuals is very important for all 
diagnostic tests. 

5 The point is similar to the one made in Section 6.9, where it was argued that sometimes the 
observed serial correlation in the residuals could be a consequence of misspecified dynamics 
(i.e., omission of lagged values of y, and x, from the equation). 

6 For instance, C. Dubbelman, Disturbances in the Linear Model: Estimation and Hypothesis 
Testing (The Hague: Martinus Nihjoff, 1978), and R. D. Cook and S. Weisberg, Residuals and 
Influence in Regression (London: Chapman & Hall. 1982). 
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The least squares residuals are obtained from the least squares regression. 
For simplicity of exposition let us assume that there is only one explanatory 
variable and no constant term, so that the regression equation is 

y, = P-q + 

The least squares residuals are 

u, = y, - p*, where p = 

These residuals satisfy the relationship 7 

2 x fi, = 0 

Defining S xx = ^ we can write u, as 

«, = y, ~ P*, 



= y, - 4 - + x iy 2 + • • • + X„y n )x, 

^XX 

Let us define h u = x,xJS xx . Thus h u = h jr Then we get 8 

u, = y, - (h,,y i + h l2 y 2 + ■ ■ ■ + h,„y „) 

= (i - hjy, + 2 Ky t 

j*‘ 


Note that 


2 K = x - 




si 


( 12 . 1 ) 


X 

= — = h 
S„ “ 


From (12.1) we get (since the y, are independent) 

var(«,) = (1 - h„) 2 (j 2 + 2 Kp 1 


J*l 


(1 - 2 h„ + h 2 ,)er 2 + ^ 




(1 - 2h„)u 2 + 2 K** 2 


( 12 . 2 ) 


Hence, using (12.2) we have 


7 See equation (4 19) We do not have X a = 0 because we do not have a constant term in the 
regression equation 

The quantities h„ play an important role in the analysis of residuals The matrix of values h„ 
is called a “hat matrix" because of the relationship v = h,,v, + h^y, + + h m y„ 
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var(w,) = (1 - h„)u 2 = 



(12.3) 


This shows that the least squares residuals are heteroskedastic. The variance 
of u, depends on the particular values of the explanatory variable x. Also, from 
(12.1) it is easy to see that the least squares residuals are correlated (because 
of the occurrence of the common y’s in the different «,). 

Thus there are two problems with the least squares residuals: 


1. They are correlated. 

2. They are heteroskedastic. 


This is so even if the true errors u, are uncorrelated and have a common vari¬ 
ance a 1 . The other residuals we will talk about are designed to solve these 
problems. 


12.4 Some Other Types of Residuals 


We will discuss four other types of residuals: 

1. Predicted residuals. 

2. Studentized residuals. 

3. BLUS residuals. 

4. Recursive residuals. 

The predicted and studentized residuals both have the same problems as the 
least squares residuals. However, some statisticians have found the predicted 
and studentized residuals useful in choosing between different regression 
models and detection of outliers, respectively. Hence we will discuss them 
briefly. The BLUS and recursive residuals both have the property that they 
have mean zero and constant variance cr 2 and are serially independent. Thus 
they solve the problems of least squares residuals. However, the BLUS resid¬ 
uals are more difficult to compute and have been found to be less useful than 
recursive residuals. Hence we will discuss them only briefly. 

Predicted Residuals 

% 

Suppose that we take sample data of n observations and estimate the regression 
equation with {n - 1) observations at a time by omitting one observation and 
then use this estimated equation to predict the y-value for the omitted obser¬ 
vation. Let us denote the prediction error by u, — y, — y(i). The u, are the 
predicted residuals. 

By y(i) we mean a prediction of y, from a regression equation that is esti¬ 
mated from all observations except the z'th observation. This is in contrast to 
y„ which is the predicted value of y, from a regression equation that is estimated 
using all the observations. 
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There is a simple relationship between the least squares residuals ft, and pre¬ 
dicted residuals u]. This is 9 0 


u, = 


I - K: 


Since V(u,) = (1 - h„)cr 2 , we have 


V(,0 = 


CF 


1 - h r , 


Thus the predicted residuals are also heteroskedastic. It has also been proved 
that the predicted residuals have the same correlation structure as the least 
squares residuals. 

Although the predicted residuals have properties similar to the least squares 
residuals, some statisticians have found them more useful than the least 
squares residuals in problems of choosing between different regression 
models. w The criterion they use is that of the predicted residual sum of squares 
(PRESS), which is defined as 

PRESS = 2 (w/) 2 

The more common criterion used is (the rationale for this is discussed in Sec¬ 
tion 12.6) 

RSS = 2 <¥ 

Since u ■ = «,/(! - h d ), from the definition of h,, we note that PRESS as a 
criterion for selection of regression models results in a preference for models 
that fit relatively well at remote values of the explanatory variables. 

The predicted residuals can be computed by using the dummy variable 
method described in Chapter 8 (see Section 8.6). We will describe it after dis¬ 
cussing studentized residuals, because the two are closely related. 


Studentized Residuals 

The studentized residual is just the predicted residual divided by its standard 
error. Thus if we are using the dummy variable method to get the ith studen¬ 
tized residual, we do the following. Estimate the regression equation with an 
extra variable D defined as 

[ 1 for the ith observation 

[0 for all others 

9 We will not be concerned with the proof here. Proofs can be found in Cook and Weisberg, 
Residuals. 

l0 See, for instance, R. L. Anderson, D. M. Allen, and F. Cady, “Selection of Predictor Variables 
in Multiple Linear Regression,” in T. A. Bancroft (ed.). Statistical Papers in Honor of George 
W. Snedecor (Ames, Iowa: Iowa State University Press, 1972). Also N. T. Quan, “The Predic¬ 
tion Sum of Squares as a General Measure for Regression Diagnostics,” Journal of Business 
and Economic Statistics, Vol. 6, 1988, pp. 501-504. 
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Then the estimate of the coefficient of D is the predicted residual and the 
t-ratio for this coefficient is the studentized residual. Thus to generate predicted 
and studentized residuals, the regressions involve all the observations in the 
sample and we create dummy variables in succession for the first, second, 
third, . . . observations. Studentized residuals are usually used in the detection 
of outliers. Suppose that there is an outlier. In the case of the least squares 
residual, it might not be detected because we use the outlier as well in com¬ 
puting the regression equation. In the case of predicted and studentized resid¬ 
uals, we use all the other observations in computing the regression equation 
and try to use it in predicting this particular (outlying) observation. Thus there 
is a better chance of detecting outliers with this method. 

The least squares and the predicted residuals both suffer from two problems. 
They are correlated and heteroskedastic even if the errors u, are uncorrelated 
and have the same variance. There have been several methods suggested for 
the construction of residuals that do not have these shortcomings. We discuss 
only two of them: the BLUS residuals suggested by Theil" and recursive resid¬ 
uals. 


BLUS Residuals 

The BLUS (which stands for “best linear unbiased scalar”) residuals are con¬ 
structed from the least squares residuals so that they have the same properties 
as the errors that is, they have zero mean (they are unbiased), are uncorre¬ 
lated, and have the same variance cr 2 as the errors 
The computation of BLUS residuals is too complicated to be described here. 
However, we need not be concerned with this because it has been found that 
in tests for heteroskedasticity and autocorrelation, there is not much to be 
gained by using the BLUS residuals as compared with the least squares resid¬ 
uals. 12 Hence we will not discuss the BLUS residuals further. 


Recursive Residuals 

Recursive residuals have been suggested by Brown, Durbin, and Evans,” for 
testing the stability of regression relationships. However, these residuals can 
be used for other problems as well, such as tests for autocorrelation and tests 


"H. Theil, “The Analysis of Disturbances in Regression Analysis,” Journal of the American 
Statistical Association, Vol. 60, 1965, pp. 1067-1079. 

'Look and Weisberg, Residuals, p. 35. 

”R. L. Brown, J. Durbin, and J. M. Evans, “Techniques for Testing the Constancy of Regres¬ 
sion Relationships” (with discussion). Journal of the Royal Statistical Society, Series B, Vol. 
37, 1975. pp. 149-163. This paper gives algorithms for the construction of recursive residuals. 
Farebrother gives algorithms for the construction of BLUS and recursive residuals. See R. W. 
Farebrother, “BLUS Residuals: Algorithm A5104” and “Recursive Residuals: A Remark on 
Algorithm A75: Basic Procedures for Large, Sparse or Weighted Least Squares Problems,” 
Applied Statistics, Vol. 25, 1976, pp. 317-319 and 323-324. 
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for heteroskedasticity. Phillips and Harvey 14 use recursive residuals for testing 
serial correlation. Since the recursive residuals are serially uncorrelated and 
have a common variance a 2 , we can use the von Neumann ratio tests (described 
in Chapter 6). There is no inconclusive region as with the Durbin-Watson test. 

We will now describe the construction of recursive residuals. First, we order 
observations sequentially. This is not a problem with time-series data. The re¬ 
cursive residual can be computed by forward recursion or backward recursion. 
We describe forward recursion only. Backward recursion is similar. The idea 
behind recursive residuals is this: Let us say that we have T observations and 
the regression equation is 

y, = 0x, + u, t = 1,2, . . . , T 

Let p, be the estimate of p from the first i observations. Then we use this to 
predict the next observation y t+l . The prediction is 

£+1 = Pr */+1 

The prediction error is 

^i+i iVz+i 3^+i 

Let us denote L(e, +1 ) by d 2 +l a 2 . (The variance of prediction error in multiple 
regression has been discussed in Section 4.7.) Then the recursive residuals, 
which will denote by w, +1 , are 


Note that var(w, +1 ) = a 2 . Now we add one more observation, estimate p using 
(/ + 1) observations, and use this to predict the next observation, y, +2 . Thus 

y ,+2 = P /+ i -* : /+2 

and if e l+2 = y l+2 - y, +2 and V(e l+2 ) = d 2 +2 a 2 , then 


We continue this process until we get to the last observation. If we have k 
explanatory variables, since we have to estimate k + 1 parameters (including 
the constant term) and obtain their variances, we need at least (k + 2) obser¬ 
vations. Thus the recursive residuals start with the observation (k + 3) and we 
have T — k — 2 recursive residuals. 

The recursive residuals have been shown to have the following properties. 15 

1. They are uncorrelated. 

2. They have a common variance a 2 . 

I4 G. D. A. Phillips and A. C. Harvey, “A Simple Test for Serial Correlation in Regression Anal¬ 
ysis,” Journal of the American Statistical Association, Vol. 69, 1974, pp. 935-939. 

''Proofs are omitted. These properties are proved in Brown, Durbin, and Evans, “Techniques 
for Testing." 
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3. Their sum of squares is equal to RSS, the residual sum of squares from 
the least squares regression. 


The third property is useful in checking the accuracy of the calculations. 

There are algorithms for calculating the regression coefficients and also 
the prediction variance d 2 a 2 , in a recursive fashion. 16 There is, however, an 
alternative method that we can use for the recursive residuals as we have done 
for the predicted residuals described earlier. Note that recursive residuals are 
similar to predicted residuals except that the predictions are sequential. 

Suppose that we use n observations and want to get the prediction error and 
the variance of the prediction error for the (n + l)th observation. Then all we 
have to do is to create a dummy variable D which is defined as 


D = 


for the (/? + l)th observation 
for all others 


We have discussed this procedure in Chapter 8. 

Now we just run a multiple regression with all the (n + 1) observations and 
this extra variable D. The estimate of the coefficient of this variable D is the 
prediction error e n+l and its standard error is ad n+f . This result has been 
proved in Chapter 8. Thus we get from the regression program e„ +l and 
d„ +l & or t„ +l the /-ratio, which is e„ +l ld n+i a. But for the recursive residuals 
we need e n+i /d n+l . Thus all we have to do is multiply the /-ratio we get for the 
coefficient of D by cr. This gives us the recursive residual. 

The calculation of recursive residuals by the dummy variable method is sim¬ 
ilar to that for predicted residuals except that the regressions for predicted re¬ 
siduals involve all the observations in the sample and for recursive residuals 
the observation set is sequential and the dummy variable is also sequential. 


Illustrative Example 

In Table 12.1 we present the OLS residual, the forward recursive, and the back¬ 
ward recursive residuals for the production function (4.24). The sum of squares 
of the recursive residuals should be equal to the sum of squares of the OLS 
residuals, but there are small discrepancies which are due to rounding errors. 
The recursive residuals are useful for: 

1. Tests for heteroskedasticity described in Chapter 5. 

2. Tests for autocorrelation described in Chapter 6. 

3. Tests for stability described in Chapter 4. 

For stability analysis. Brown, Durbin, and Evans suggest computing the cu¬ 
mulative sums (CUSUM) and cumulative sums of squares (CUSUMSQ) of the 
recursive residuals and comparing them with some percentage points that they 
have tabulated. A discussion of this is beyond the scope of this book. 17 Instead, 

"’Phillips and Harvey, “A Simple Test.” 

l7 Those interested in this detail can refer to the paper by Brown, Durbin, and Evans, “Tech¬ 
niques for Testing." 
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Table 12.1 Different Residuals for the Production Function (4.25) (Multipled by 100) 


Year 

OLS Residual 

Forward Recursive 

Backward Recursive 

1929 

-1.27 


-1.304 

1930 

-4.29 


-4.530 

1931 

-3.99 


-4.710 

1932 

-1.30 


-3.044 

1933 

-3.04 

-1.975 

-6.187 

1934 

-1.99 

-0.809 

-6.476 

1935 

-0.79 

0.512 

-6.440 

1936 

1.04 

1.115 

-4.246 

1937 

-0.26 

-1.040 

-4.359 

1938 

4.55 

3.824 

-2.900 

1939 

4.93 

4.163 

-2.224 

1940 

4.37 

4.182 

-2.668 

1941 

1.84 

1.335 

-2.971 

1942 

-1.74 

-1.456 

-4.775 

1943 

0.13 

1.609 

-3.110 

1944 

6.59 

6.065 

3.502 

1945 

11.11 

8.487 

8.844 

1946 

1.34 

-4.811 

5.452 

1947 

-6.59 

-10.322 

-1.531 

1948 

-5.90 

-7.172 

-1.884 

1949 

-2.06 

-1.601 

-0.132 

1950 

0.12 

1.343 

2.331 

1951 

-2.68 

-0.884 

0.273 

1952 

-3.24 

-0.667 

-0.702 

1953 

-2.38 

0.883 

-0.716 

1954 

1.10 

4.137 

1.293 

1955 

1.48 

3.540 

2.917 

1956 

-1.52 

0.186 

0.826 

1957 

-1.00 

0.772 

0.764 

1958 

1.43 

2.917 

0.912 

1959 

0.00 

1.083 

0.989 

1960 

-1.39 

-0.367 

-0.854 

1961 

0.38 

1.400 

-2.372 

1962 

1.21 

1.963 

-1.994 

1963 

2.51 

2.865 

-1.793 

1964 

2.31 

2.297 


1965 

1.85 

1.561 


1966 

0.00 

-0.328 


1967 

-2.97 

-3.160 


Mean 

0.00 

0.618 

-1.252 

SD 

3.379 

3.506 

3.338 

Range 

17.70 

18.81 

15.32 

SS 

433.8 

431.3 

433.6 

t 

0.00 

1.04 

-2.22 
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we will apply some simple /-tests. If a test of the hypothesis that the mean of 
the recursive residual is zero gives us a significant /-statistic, this is an indica¬ 
tion of instability of the coefficients. In Table 12.1 we present these /-ratios. 
With the forward recursive residuals we do not reject the hypothesis of zero 
mean. With the backward recursive residuals, we do reject at the 5% level, the 
hypothesis of zero mean. Thus the conclusions from the recursive residuals are 
similar to the conclusions we arrived at earlier in Chapter 4 from the predictive 
tests for stability. 


12.5 DFFITS and Bounded 
Influence Estimation 


In Section 3.8 we discussed briefly the problem of outliers. The usual approach 
to outliers based on least squares residuals is as follows: 

1. Look at the OLS residuals. 

2. Delete the observations with large residuals. 

3. Reestimate the equation. 

Two major problems with this approach are that the OLS residuals (as we 
showed earlier) do not all have the same variance and furthermore the OLS 
residuals do not give us any idea of how important this particular observation 
is for the overall results. The idea behind the studentized residual is to allow 
for these differences in the variances and to look at the prediction error result¬ 
ing from the deletion of this observation. Using a plus or minus 2o rule of 
thumb, the studentized residuals shown in Table 12.2 suggest that the obser¬ 
vations for the years 1945 and 1947 are outliers. With the OLS residuals in 
Table 12.1 we might have included 1944 and 1948 as well. 

One other measure that is used to detect outliers is to see the change in the 
fitted value y of y that results from dropping a particular observation. Let y<„ 
be the fitted value of y if the /th observation is dropped. The quantity ( v, - 
divided by the scaling factor h„S„ where S; is the estimator of a 2 from a regres¬ 
sion with the /th observation omitted is called DFFITS,. 

It has been shown that 18 



where ii, is the /th studentized residual and h,, is the quantity that figures re¬ 
peatedly in equations (12.1)—(12.3). (It is also known as the /th diagonal term 
of the “hat matrix.”) 

There are many computer programs available to compute studentized resid¬ 
uals and DFFITS. For instance, the SAS regression program gives these statis- 


18 D. A. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics: Identifying Influential Data 
and Sources of Collinearity (New York: Wiley, 1980). 
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Table 12.2 Regression Diagnostics for Detecting Outliers for the Production 
Function (4.24) 


Year 

Studentized 

Residual 

DFFITS 

Year 

Studentized 

Residual 

DFFITS 

1929 

-0.38 

-0.08 

1949 

-0.61 

-0.16 

1930 

-1.28 

-0.35 

1950 

0.04 

0.01 

1931 

-1.21 

-0.39 

1951 

-0.79 

-0.19 

1932 

-0.42 

-0.21 

1952 

-0.96 

-0.22 

1933 

-0.97 

-0.46 

1953 

-0.70 

-0.15 

1934 

-0.61 

-0.22 

1954 

0.32 

0.06 

1935 

-0.24 

-0.07 

1955 

0.43 

0.08 

1936 

0.31 

0.08 

1956 

-0.45 

-0.09 

1937 

-0.08 

-0.02 

1957 

-0.29 

-0.06 

1938 

1.36 

0.37 

1958 

0.42 

0.10 

1939 

1.46 

0.37 

1959 

0.01 

0.00 

1940 

1.29 

0.29 

1960 

-0.41 

-0.11 

1941 

0.54 

0.12 

1961 

0.11 

0.03 

1942 

-0.51 

-0.12 

1962 

0.37 

0.11 

1943 

0.04 

0.01 

1963 

0.75 

0.23 

1944 

1.94 

0.46 

1964 

0.70 

0.22 

1945 

3.28 

0.88 

1965 

0.56 

0.18 

1946 

0.42 

0.17 

1966 

0.02 

0.01 

1947 

-2.04 

-0.85 

1967 

-0.91 

-0.33 

1948 

-1.80 

-0.66 





tics. The results in Table 12.2 have been obtained from the SAS regression 
program. 

Belsley, Kuh, and Welsch suggest that DFFITS is a better criterion to detect 
outliers and influential observations. DFFITS is a standardized measure of the 
difference in the fitted value of y due to deleting this particular observation. 
Further, they suggest that observations with large studentized residuals or 
DFFITS should not be deleted. Their influence should be minimized. This 
method of estimation is called bounded influence estimation. The details of this 
method are complicated but a simple one-step bounded influence estimator sug¬ 
gested by Welsch 19 is as follows: Minimize ^ - (3x,) 2 , where 


w, = 


1 

0.34 


jDFFITSj 


if |DFFITS,| < 0.34 
if |DFFITSJ > 0.34 


l9 Roy E. Welsch, “Regression Sensitivity Analysis and Bounded Influence Estimation,” in J. 
Kmenta and J. B. Ramsay (eds.). Evaluation of Econometric Models (New York: Academic 
Press, 1980), pp. 153-167. 
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Illustrative Example 

As an illustration, consider again, the production function (4.24). The 
of DFFITS are shown in Table 12.2. There are nine observations (all 
1948) that have |DFFITS,| > 0.34. These observations receive a weight 
the bounded influence estimation. 

Using this weighting scheme, we obtained the following results: 


Estimate of: 

Bounded 

Influence 

OLS 

OLS with 
Outlier 
Deletion 

a 

-3.987 

-3.938 

-3.980 

p . 

1.468 

1.451 

1.466 

02 

0.375 

0.384 

0.376 


For comparison we also present the estimates from the OLS regression (4.24) 
and also estimates using OLS after deleting the observations for 1944, 1945, 
1947, and 1948, the years for which the OLS residuals in Table 12.1 are large. 

In this example there was not much difference in the estimated coefficients. 
In fact, the bounded influence method and OLS with outlier deletion gave al¬ 
most identical results. The data set we have used is perhaps not appropriate 
for the illustration of the bounded influence method. The problem of parameter 
instability and autocorrelated errors seems to be more important with this data 
set than that of detection of outliers. 

In any case the preceding discussion gives an idea of what “bounded influ¬ 
ence estimation” is about. The basic point is that the OLS residuals are not 
appropriate for detection of outliers. Further, outliers should not all be dis¬ 
carded. Their influence on the least squares estimates should be reduced 
(bounded) based on their magnitude. 

As mentioned earlier, the data set we have used has not turned out to be 
appropriate for illustrating the method. Other data sets in the book can be used 
to check out the usefulness of the method. 

Krasker 20 gives an interesting example of the use of bounded influence esti¬ 
mation. The problem is a forecasting problem faced by Korvett’s Department 
Stores in 1975. The company has to choose between two locations, A and B to 
start a new store. Data are available for 25 existing stores on the following 
variables. 

y = sales per capita 
x, = medium home value (x 10 ~ 6 ) 
x 2 = average family size (x 10 ~ 2 ) 

x 3 = percent of population which is black or hispanic (x 10“ 4 ) 
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values 
before 
< 1 in 


“W. S. Krasker, “The Role of Bounded Influence Estimation in Model Selection,” Journal of 
Econometrics, Vol. 16, 1981, pp. 131-138. 
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The regression results were (dependent variable: sales) (figures in parentheses 
are standard errors) as follows: 



Constant 

*i 

X 2 

*3 

<T 

OLS 

-0.13 

2.70 

2.5 

0.22 

0.014 


(0 05) 

(0 511 

(1 1) 

(3 1) 


WLS 

-0.05 

LOO 

1.7 

-4.1 

0.010 


(0 04) 

(0 52) 

O 0) 

(2 9) 



Note the change in the coefficient for jc 3 . The WLS estimator is the weighted 
least squares. 21 Krasker argues that there are two outliers (observations 2 and 
11 in this sample). The other 23 observations are “well described by an OLS 
regression whose estimates are essentially those of the WLS.” Thus again in 
this example the bounded influence estimator does not appear to be different 
from the OLS with the two outliers omitted. (Results from OLS with 23 obser¬ 
vations are not presented here.) 

Krasker suggests that site A is similar to observation 2 and if the model 
cannot be used to predict observation 2. it should not be used to make predic¬ 
tions for site A (with 50.8% of the population from minorities). The model (OLS 
with 23 observations or WLS with all observations) can be used to make pre¬ 
dictions for site B. 


12.6 Model Selection 


In the usual textbook econometrics, the statistical model underlying the data 
is assumed to be known at the outset and the problem is merely one of obtaining 
good estimates of the parameters in the model. In reality, however, the choice 
of a model is almost always made after some preliminary data analysis. For 
instance, in the case of a regression model, we start with a specification that 
seems most reasonable a priori. But after examining the coefficients, their 
standard errors and the residuals, we change the specification of the model. 
Purists would consider this “data mining” as an illegitimate activity, but it is 
equally unreasonable to assume that we know the model exactly at the very 
outset. 

The area of model selection comprises of: 

1. Choice between some models specified before any analysis. 

2. Simplification of complicated models based on the data (data-based sim¬ 
plification). 

3. Post-data model construction. 


2l The weighting scheme is slightly different from the weighting scheme discussed in Welsch, 
“Regression Sensitivity.” 
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The area of model selection is quite vast in its scope and includes diagnostic 
checking and specification testing (the other two areas we are discussing in this 
chapter). There are many references on the topic, some of which are a book by 
Learner, 22 a paper by Mizon, 23 the dissertation by Sawyer, 24 and a special vol¬ 
ume of the Journal of Econometrics, 2S Since it is impossible to cover this vast 
area, we discuss Learner’s classification of the different types of model 
searches usually attempted, and also Hendry’s ideas behind data-based-simpli- 
fication of models. In Sections 12.7 to 12.9 we go through two particular aspects 
of model selection: 

1. Selection of regressors. 

2. Use of cross-validation techniques. 

Learner talks of six types of specification searches that are usually under¬ 
taken in the process of model selection. The differences between the different 
searches are very minor. However, they are useful for organization of our 
ideas. The different searches are: 


Type of Search 

(1) Hypothesis-testing search 

(2) Interpretive search 

(3) Simplification search 

(4) Proxy variable search 

(5) Data selection search 

(6) Post-data model 
construction 


Purpose 

Choosing a “true model” 

Interpreting the sample evidence on 
many intercorrelated variables 
Constructing a “fruitful” model 
Choosing between different measures of 
the same set of hypothetical variables 
Selecting the appropriate data set for 
estimation and prediction 
Improving an existing model 


Many of these searches have been discussed in previous chapters. But we 
will give further examples: 


Hypothesis-Testing Search 

Suppose that we estimate a Cobb-Douglas production function as in Section 
4.11, and test the hypothesis of constant returns to scale (a + (3 = 1 in that 
example). If the hypothesis is rejected, as in that example, we do not change 
the specification of the model. If it is not rejected, we change the specification 
and estimate a production function with constant returns to scale. 


22 E. E. Learner, Specification Searches (New York: Wiley, 1978). 

23 G. E. Mizon, “Model Selection Procedures,” in M. J. Artis and A. R. Nobay (eds.). Studies 
in Current Economic Analysis (Oxford: Basil Blackwell, 1977), Chap. 4. 

24 K. R. Sawyer, “The Theory of Econometric Model Selection,” unpublished doctoral disser¬ 
tation, Australian National University, 1980. 

25 G. S. Maddala (ed.), “Model Selection,” Journal of Econometrics, Vol. 16, 1981. 
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Interpretive Search 

Sometimes the coefficients of the model do not make economic sense but the 
imposition of some constraints does. For instance, based on data for 150 house¬ 
holds, Learner 26 estimated the demand for oranges as 

log D, = 3.1 + 0.83 log E, + 0.01 log P, — 0.56 log it, R 1 = 0.20 

(I 0) (0.20) (015) (0.60) 

where D, = purchases of oranges by household i 
E, = total expenditures by household i 
P, = price of oranges 

it, = price of grapefruit (a substitute commodity) 

(Figures in parentheses are standard errors.) The coefficients of the price vari¬ 
ables are insignificant and have the “wrong” sign. Also, the sum of the coeffi¬ 
cients (0.83 + 0.01 - 0.56 = 0.28) is rather far from zero. If there is no money 
illusion, then multiplying E„ P„ and it, by the same factor should not produce 
any change in D,. This implies that the sum of the coefficients of these variables 
should be zero. (This is known as the “homogeneity postulate.”) Imposing this 
condition. Learner gets the result: 

log D, = 4.2 + 0.52 log E, - 0.61 log P, + 0.09 log it, R 2 = 0.19 

(0 9) (0 19) (0 14) (0 31) 

The R 1 has fallen only slightly and the coefficients all have the right signs. 
Income Y, and price P, are significant. The constraint has improved the speci¬ 
fications, and the interpretation. 

Simplification Search 

In the equation above the coefficient of log it, is not significant. Dropping this 
variable and imposing the homogeneity constraint, that is, assuming the other 
two coefficients to be equal in value and opposite in sign, we get 

log £>, = 3.7 + 0.58 log (EJP) R 2 = 0.18 
(0 8 ) (0 18 ) 

The R 2 is only slightly smaller and we have a simplified equation. This is called 
a simplification search. The purpose of this search is to find a simple but useful 
model. 

Proxy Variable Search 

In econometric work an investigator is faced with several definitions of the 
same variable. There are several definitions of money supply, several defini¬ 
tions of income, and so on. Further, some variables like education, ability, and 
risk are not directly measurable and we have to use some proxies for them. We 

“Learner, Specification Searches, p. 8. 
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are thus left with the problem of choosing among the different proxies. In the 
example of demand for oranges that Learner considered, one has to choose 
between money income Y, and expenditures E, as measures of the household’s 
true income. The estimated equations he gets are 

log D, = 6.2 + 0.85 log Y, - 0.67 log P, R 2 = 0.15 

( 11 ) (0 21 ) (0 13 ) 

and 

log D,= 5.2 + 1.1 log E, - 0.45 log P, R 2 = 0.18 
(1 0 ) ( 018 ) (0 16 ) 

The R 5 has increased with the use of E„ suggesting that E, is a better proxy 
than Y, for “true income.” 


Data Selection Search 

Often, in econometric work we have different data sets from which we can 
estimate the same relationship. A question often arises as to whether we can 
pool the different data sets and get more efficient estimates of the parameters. 
In Section 4.6 we gave some examples where the data sets referred to prewar 
and postwar years. There we found significant differences in the coefficients 
between the two periods which suggested that the data should not be pooled. 
This is an example of a data-selection search. 


Post-data Model Construction 

This is what Learner calls “Sherlock Holmes” inference. In response to a ques¬ 
tion by Dr. Watson about the likely perpetrators of the crime, Sherlock Holmes 
replied: “No data yet. ... It is a capital mistake to theorize before you have 
all the evidence. It biases the judgments.” According to the traditional statis¬ 
tical theory, on the other hand, it is a “capital mistake to view the facts before 
you have all the theories. It biases the judgments.” Any theory that is postu¬ 
lated after looking at a particular data set cannot be tested using the same data 
set, because doing so would amount to double counting. On the other hand, 
Sherlock Holmes would argue that the set of viable alternative hypotheses is 
immense and the set of hypotheses formulated before the data set is observed 
can be incomplete. There is always the risk that the data favor some unspeci¬ 
fied hypothesis. Hence the data evidence is used to construct a set of “empir¬ 
ically relevant” hypotheses, thereby reducing the cost of formulating a com¬ 
prehensive set of hypotheses and the risk of not identifying the “best” 
hypothesis. 

In almost all econometric work investigators do something similar to what 
Sherlock Holmes does. They formulate some hypotheses, then observe that 
the coefficients of some variables have wrong signs or implausible magnitudes 
or that the residuals have a peculiar pattern. Then they introduce more explan¬ 
atory variables or impose some constraints on the parameters. A question 
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arises as to whether this type of Sherlock Holmes inference can be brought 
within the scope of statistical inference. 

Learner’s answer to this is that we have to view this new “data-instigated 
hypothesis” as something that was in the back of our mind all the time. We did 
not consider it explicitly because we thought it to be unimportant but now we 
change our mind after observing the data. Suppose that the model we want to 
consider is y = + yz + it. Initially, we consider z to be uncorrelated with 

x or that y is negligibly small. So we start with the model y = fix + it. If the 
resulting estimate of (3 has the wrong sign or the pattern of estimated residuals 
is peculiar, we change our mind and observe z. If the problem is viewed in this 
form of a decision problem. Learner argues that we can attach the appropriate 
standard errors to the coefficients when the model y = (3jc + yz + u is esti¬ 
mated. 

In all cases of specification searches, there is the question of what the appro¬ 
priate standard errors should be for the final model estimated. There is no easy 
answer to this question except that the standard errors are higher than those 
obtained from the estimation of the final model. The inference problem is more 
straightforward in the case of a simplification search. That is why David Hen¬ 
dry suggests starting with an extremely general model and then simplifying it 
progressively. We will, therefore, discuss this approach in greater detail. 

Hendry’s Approach to Model Selection 

The approach to model building suggested by David Hendry, 27 which is mainly 
applicable to dynamic time-series models, can be summarized as: intended ov- 
erparametrization with data-based simplification. By contrast most of empiri¬ 
cal econometric work can be characterized as excessive presimplification with 
inadequate diagnostic testing. The latter method consists of the following 
steps: 

1. Commence from theories which are drastic abstractions of reality. 

2. Formulate highly parsimonious relationships to represent the theories. 

3. Estimate the equations from the available data using techniques which 
are “optimal” only on the assumptions that the highly restricted model is 
correctly specified. 

4. Test a few of the assumptions explicitly or implicitly (such as the conven¬ 
tional Durbin-Watson test for autocorrelation). 

5. Revise the specification in the light of evidence acquired. 

6. Reestimate accordingly. 

According to Hendry, this approach to model building, which is a “specific 
to general” approach or a “bottoms-up” approach, has three main drawbacks: 

27 This approach is originally due to J. D. Sargan but has been popularized and expounded by 
David Hendry. Some representative papers are: D. F. Hendry, “Predictive Failure and Econo¬ 
metric Modelling in Macroeconomics: The Transactions Demand for Money,” in Paul Ormerod 
(ed.). Economic Modelling (London: Heinemann, 1979), Chap. 9, pp. 217-242; and Mizon, 
“Model Selection Procedures.” 
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I. Every test is conditional on arbitrary assumptions which are to be tested 
later, and if these are rejected, all earlier inferences are invalidated. 

%. The significance levels of the unstructured sequence of tests actually con¬ 
ducted is unknown. For instance, suppose that we estimate an equation 
by OLS, find a significant DW statistic, and then reestimate the equation 
adjusting for the serial correlation. What are the significance levels for 
the estimated coefficients from this transformed equation? This is not 
very clear. 

3. It is not always possible to end up with the best model by using this 
iterative method. We might get sidetracked by using the wrong (or inad¬ 
equate) diagnostic tests. For instance, we might start with the equation 
y - fix + u and observing a significant DW test statistic, we reestimate 
the equation in a p-differenced form. If we now find that the coefficients 
of our equation are of the correct sign, we might rest satisfied. However, 
this procedure might lead us to the wrong model because as discussed in 
Section 6.9 the DW statistic could be significant not because of serial 
correlation in errors but because of misspecified dynamics. The model 
with serially correlated errors 

y, = P*, + it, u, = P«,-i + e, 

implies the model 

y, = py ,—i + P*, ~ Pp*,-i ■+ e, 
which is the same as 

y, = Pi>’ ( I + P2*f + Pv*, 1 + e, 

with the restriction p,p 2 = - p 3 . Now the true model could be the last 
one but with the restriction p,p 2 = - p 3 not satisfied. But there is no way 
of our arriving at this model by the modeling approach adopted. When 
we estimate the model y, = fix, + u, we would observe serial correlation 
in the residuals because y, and x,_ , are “missing.” But this does not 
mean that we have a model with serially correlated errors. 

By contrast, the approach suggested by Hendry, which is a “top-down” or 
“general to specific” approach, starts with a very general dynamic model, 
which is “overparametrized,” that is, has more lags than you would consider 
necessary. The model is then progressively simplified with a sequence of “sim¬ 
plification tests.” The significance levels for the sequence of tests is known 
unlike the case of the sequence of tests we perform in the “specific-to-general” 
approach. The significance level for jth test is 28 

1 - II (1 - 7;) 

(=i 

28 This result was derived by T. W. Anderson in the paper cited in Section 10.8. This result has 
been proved for dynamic econometric models by J. D. Sargan (in an unpublished London 
School of Economics discussion paper). 
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where y, is the significance level for the /th test in the sequence. The sequential 
testing procedures are used to select a “data coherent specialization.” 

Hendry argues that only after these steps should one test economic theories. 
“Until the model characterizes the data generation process, it seems rather 
pointless trying to test hypotheses of interest in economic theory.” 29 


12.7 Selection of Regressors 


In Section 12.6 we outlined some general approaches to the model selection 
problem. We now concentrate on a specific issue, the problem of selection of 
regressors. 

In Chapter 4 we considered a multiple regression of an explained variable y 
on a set of k explanatory variables x u x 2 , .... x t . It was assumed that the set 
of variables to be included in the equation is given. In practice this is rarely the 
case. There is typically a very large number of potential explanatory variables 
or regressors and one is faced with a problem of choosing a subset of these. 
This is the problem often referred to as the problem of selection of regressors. 

Suppose that there is a potential set of k regressors from which we have to 
select a smaller number. In the 1960s a number of stepwise regression methods 
were suggested. 30 Some of these started with a regression, including all the k 
variables and successively proceeded to eliminate variables with /-ratios less 
than a prespecified value (say, 1). This is called a backward selection proce¬ 
dure. Others started with a single variable which had the highest correlation 
with y and then picked at each stage the variable with the highest partial cor¬ 
relation coefficient (this is called forward selection procedure). Some proce¬ 
dures combined the elements of the forward and backward selection proce¬ 
dures at each stage (i.e., deleted some variables and added some others). 

This sort of mechanical picking of variables by the computer is no longer 
popular among econometricians. Fortunately, although economists do not have 
an exact idea of all the variables that have to be included in an equation, they 
do have an idea of what variables are likely to be very important and what 
variables are doubtful. In this case we can specify a small number of alternative 
models and then we need to choose one of these alternatives. Since theory 
cannot give us any guidance at this stage we have to make a choice on statistical 
grounds. 

There are many criteria that have been suggested in the literature. Some of 
these are listed in Table 12.3. We discuss them briefly. The criteria that we have 
chosen in Table 12.3 are applicable to regression models only and we have 
chosen those criteria that depend on some summary measures commonly used 
like residual sum of squares. 


M Hendry, “Predictive Failure,” p. 226. 

“These methods are not popular at present. However, those interested in a discussion of step¬ 
wise procedures can refer to N. Draper and H. Smith, Applied Regression Analysis, 2nd ed. 
(New York: Wiley, 1981), Chap. 6. 
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Table 12.3 Some Criteria for Choice Among Regression Models 


Criterion 

Minimize“ 

Theil’s R 2 (1961) 

RSS Jin - kj ) 

blocking’s S p (1976) 

RSS ( /[(« - kj)(n — kj — 1)] 

Mallow’s C p (1973) 

RSS, + 2 kj&i 

Amemiya’s PC (1980) 

RSS \jin + kj)/(n - k) 

Akaike’s AIC (1973, 1977) 

RSS,exp[2(A:, + 1 )ln\ 


"k,, number of explanatory variables; RSS,, residual sum of squares for the /th model; & 2 m , 
(residual sum of squares)/(« - k) in the model that includes all the k explanatory variables. 


Let RSS denote the residual sum of squares from the jth model with k t ex¬ 
planatory variables. We define 


an estimate of a 2 from the jth model. We denote cr 2 , as the estimate of cr 2 from 
a model that includes all the k explanatory variables. 

We now discuss briefly the criteria listed in Table 12.3. 

Theil’s R z Criterion 

Theil’s criterion 31 is based on the assumption that one of the models considered 
is the correct model. In this case if a 2 = RSS,/(/i - k) is the estimate of cr 2 
from the jth model, then E(aj) = a 2 for the correct model but is > cr 2 for the 
misspecified model. 32 Thus, choosing the model with the minimum cr 2 will on 
the average lead us to pick the correct model. Since minimizing tf 2 is the same 
as maximizing R 2 [see equation (4.22)], we refer to the rule alternatively as the 
maximum R 2 rule. 

A major problem with this rule is that a model that has all the explanatory 
variables of the correct model but also a number of irrelevant variables will 
also give E{a 2 ) = cr 2 . Thus the rule will not help us pick the correct model in 
this case. This indeed is confirmed by the results in Schmidt and Ebbeler 33 
concerning the power function of the maximum R 2 criterion. The probability 


3I H. Theil, Economic Forecasts and Policy, 2nd ed. (Amsterdam: North-Holland, 1961). 

“The proof of this result can be found in several books. See, for instance, G. S. Maddala, 
Econometrics (New York: McGraw-Hill, 1977), pp. 461-462. It is shown there that a model that 
has all the explanatory variables of the correct model but also a number of irrelevant variables 
will also give E( cr) = cr. 

31 P. Schmidt, “Calculating the Power of the Minimum Standard Error Choice Criterion,” Inter¬ 
national Economic Review, February 1973, pp. 253-255. The numerical errors in Schmidt’s 
paper are corrected in D. H. Ebbeler, “On the Probability of Correct Model Selection Using 
the Maximum R 2 Choice Criterion,” International Economic Review. June 1975. pp. 516-520. 
There are three types of misspecifications considered in these papers: omitted variables, irrel¬ 
evant variables, and wrong variables. For the first two cases, Ebbeler shows that some simple 
analytical results are available. 
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of picking the correct model is considerably below 1 when the alternative 
model includes a number of irrelevant variables. 


Criteria Based on Minimizing the 
Mean Square Error of Prediction 

Theil’s criterion is based on minimizing the standard error of the regression. 
The following three criteria are based on minimizing the mean-square error of 
prediction: 

1. Mallows’ C p criterion. 34 

2. Hocking’s S p criterion. 35 

3. Amemiya’s PC criterion. 36 

Suppose that the correct equation involves k variables and the equation we 
consider involves k, (< k) variables. The problem is how to choose the number 
k„ as well as the particular set of k, variables. In the prediction criteria we are 
considering this is done by minimizing the mean-squared error of prediction 
E(y f — y f ) 2 , where y f is the future value of >y and >y is the predicted value. If we 
denote the future values of the explanatory variables by x lf , then E[yy - yj) 2 \x tf \ 
is called the conditional MSE (mean-square error) of prediction and E(y f - y f ) 2 
is called the unconditional MSE of prediction. 

To get the unconditional MSE of prediction we have to make some assump¬ 
tions about x lf . Amemiya assumes that the regressors for the prediction period 
are stochastic and that the values of x, r have the same behavior as the variables 
x„ during the sample period. Under this assumption he shows that 37 

_ . „., 2k, , RSS, 

estimate of E(y f — vy) — — cr h- 

n n 

where RSS, is the residual sum of squares fsrom the model with k, regressors. 

Now cr 2 has to be estimated. If we use cr 2 „ = RSS/(« - k), where RSS is the 
residual sum of squares from the complete model with k explanatory variables, 
we get Mallows’ C p criterion. On the other hand, if we use = RSS,/(« - 
ki), then we get Amemiya’s PC criterion. Note, however, that RSS,/(n - k,) is 


34 C. L. Mallows, “Some Comments on C r „" Technometrics, November 1973, pp. 661-676. The 
criterion was first suggested by Mallows in 1964 in an unpublished paper. 

3, R. R. Hocking, “The Analysis and Selection of Variables in Multiple Regression,” Biometrics, 
March 1976, pp. 1-49. This has been further discussed in two papers by M. L. Thompson, 
“Selection of Variables in Multiple Regression, Part 1: A Review and Evaluation,” and “Part 
2: Chosen Procedures, Computations and Examples,” International Statistical Review, Vol. 
46, 1978, pp. 1-19 and pp. 129-146. Also see U. Hjorth, “Model Selection and Forward Vali¬ 
dation,” Scandinavian Journal of Statistics, Vol. 9, 1982, pp. 1-49. 

36 T. Amemiya, “Selection of Regressors,” International Economic Review, Vol. 21, 1980, pp. 
331-354. 

37 We will not be concerned with the proof here. It can be found in Amemiya, "Selection of 
Regressors.” 
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an upward-biased estimate of cr 2 because of the fact that it is an estimate from 
a misspecified regression. 38 On the other hand, d 2 is an unbiased estimate of cr 2 
if it is assumed that the model with k, variables is the correct model and the 
model with k variables includes a number of irrelevant variables. This is the 
“optimistic” assumption that Amemiya makes. However, if we are comparing 
different models with different sets of variables by the PC criterion, we cannot 
make the “optimistic” assumption that every one of these is the true model. 
This is one of the major problems with Amemiya’s PC criterion. It is more 
reasonable to assume that 6f„ is the appropriate estimate of cr 2 as assumed in 
Mallows’ C p criterion. 

The important thing to note in the discussion above is that the regressors in 
the sample period are assumed nonstochastic, whereas they are assumed sto¬ 
chastic in the prediction period. Hocking, by contrast, assumes the regressors 
to have a multivariate normal distribution and derives the S p criterion given in 
Table 12.3 by minimizing the unconditional MSE of prediction. Kinal and 
Lahiri 39 show that in the stochastic regressor case Amemiya’s PC and Mallows’ 
C p both reduce to Hocking S p criterion. Breiman and Freedman 40 give an alter¬ 
native justification for Hocking’s S p criterion. It is not necessary for us to go 
through the different ways in which the criteria C p and S p have been derived 
and justified. This discussion can be found in the papers by Thompson cited 
earlier. The important thing to note is that Amemiya’s PC and Mallows’ C p 
depend on the assumption of nonstochastic regressors, whereas Hocking’s S p 
depends on the assumption of stochastic regressors. A consequence of this 
assumption is that (as noted by Kinal and Lahiri) both the C p and PC criteria 
require an estimate of the variance cr 2 of the disturbance in the true model, 
whereas S p does not. It just requires an estimate of the variance of the distur¬ 
bance term in the restricted model. 

One more important thing to note is that the maximum R 2 criterion and the 
predictive criteria C p , S p , and PC answer two different questions. In the case 
of the maximum R 2 criterion, what we are trying to do is pick the “true” model, 
assuming that one of the models considered is the “true” one. In the case of 
the prediction criteria, we are interested in “parsimony” and we would like to 
omit some of the regressors (even if a model that includes them is the true 
model) if this improves the MSE of prediction. For the latter problem, the ques¬ 
tion is whether we need to assume the existence of a true model or not. For 
the Cp and PC criteria, as we have seen, we do need to think in terms of a 
“true” model. For the S p criterion (or in the stochastic regressor case), we do 
not have to worry about whether one of the models is the true model or not 
and what regressors are there in the “true” model. 


“This is in fact the basis for the maximum R 2 rule. 

39 T. Kinal and K. Lahiri, “A Note on ‘Selection of Regressors,’” International Economic Re¬ 
view, Vol. 25, No. 3, October 1984. 

W L. Breiman and D. Freedman, “How Many Variables Should Be Entered in a Regression 
Equation?” Journal of the American Statistical Association, March 1983, pp. 131-136. 
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Akaikes Information Criterion 

Akaike’s information criterion 41 (AIC) is a more general criterion that can be 
applied to any model that can be estimated by the method of maximum likeli¬ 
hood. It suggests minimizing 

- 2 log L 2k 
n n 

where k is the number of parameters in L. For the regression models this cri¬ 
terion implies minimizing [RSS exp (2 kin)], which is the criterion listed in Table 
12.3. 

We have included AIC in our list because it is the one that is commonly used 
(at least in nonlinear models) and like the other criteria listed in Table 12.3, it 
involves RSS, and A, only. Thus it can be computed using the standard regres¬ 
sion programs. 


12.8 Implied F-Ratios for the 
Various Criteria 


We saw earlier (Section 4.10) that maximizing R 2 implies deletion of variables 
with an F-ratio < 1. We can derive similar conditions for the C p , S p , and PC 
criteria discussed in Section 12.7. 

Consider the restricted model with A, (< k) variables, so that k 2 variables are 
deleted (A = A, + A 2 ). Define X = RRSS/URSS, where RRSS and URSS are 
the residual sum of squares from the restricted and unrestricted models, re¬ 
spectively. 

Then, as derived in Section 4.8, the F-ratio for testing the hypothesis that 
the coefficients of the A 2 excluded variables are zero is 

(RRSS - URSS)/A 2 = (X - l)(n - A) 

URSS/(n - A) A 2 { } 

If the restricted model is chosen, then for the R 2 criterion, this means that 


RRSS URSS 
n — A, n — A 



(12.5) 


Substituting in (12.4) and noting that A = A, + k 2 , we get the condition F < 1. 
For Amemiya’s PC criterion we have 


"'This criterion is derived in H. Akaike, “Information Theory and an Extension of the Maxi¬ 
mum Likelihood Principle,” in B. N. Petrov and F. Csaki (eds.), 2nd International Symposium 
on Information Theory (Budapest: AkadCmiai Kiado, 1973), and H. Akaike, “On Entropy Max¬ 
imization Principle,” in P. R. Krishniah (ed.). Applications of Statistics (Amsterdam: North- 
Holland, 1977). 
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n — k, n + k 

x <-r-r 

n + /c, n — k 

Substituing this in (12.4) and simplifying, we get 


( 12 . 6 ) 


n + k { 


For the C p criterion we have 


(12.7) 


X < 1 + 


n - k 


and thus F <2. For the S„ criterion 


(n - k,){n -£,-!) 
(n — k){n — k — 1) 


and thus 


For the AIC criterion 


F< 2 + 


k 2 + 1 
n - k - 1 


exp(A7n) 

exp(&|/«) 


( 12 . 8 ) 


(12.9) 


( 12 . 10 ) 


( 12 . 11 ) 


Substituting this in (12.4), we do not get any easy expression as in the other 
cases, but we can tabulate the values for different values of kin and k,/n. 

For n large relative to k, we have 

(k\ , k 

expl - I = 1 H— 

\nl n 


and hence 


Substituting this in (12.4), we get 


n + k 
n + 


F < —- - which is < 1 

n + k x 


( 12 . 12 ) 


(12.13) 


The F-values for deletion of the k 2 variables implied by the different criteria 
are shown in Table 12.4. They stand in the following relation: 

F(AIC) < F(R 2 ) < 1 < F(PC) < F(C ) < 2 < F(S P ) (12.14) 


if n is large relative to k, note that PC, C p and S p all imply a cutoff value of 
F = 2. 
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Table 12.4 F- Values for Different Criteria 



Choose the Restricted 
Model if the F-Ratio for 
Testing the Restrictions 

Criterion 

Gives: 

Maximum R 2 

F < 1 

Mallows’ C p 

F < 2 

Amemiya’s PC 

2n 

F < -- - 

n + & } 

Hocking’s S p 

kj + 1 

F<2 + —-- 

n — k — 1 

Akaike’s AIC (for n 
large relative to k ) 

n + k t 


The choice between the restricted and unrestricted models is based on an F- 
test for the restrictions. The use of the F-ratios in Table 12.4 implies two things: 

1. The significance level that we use for the F-tests, for testing the restric¬ 
tions, is not the conventional 5% level of significance. In fact, we use a 
much higher level of significance. 

2. The significance level used, in general, decreases with the sample size. 
This is particularly true for the cases where we use a constant F-ratio like 
1 or 2, irrespective of the sample size. It is also true of Amemiya’s PC 
criterion, where the F-ratio is less than 2 for small samples and ap¬ 
proaches 2 as n —> co. However, with Hocking’s S p criterion, the cutoff 
point of F declines toward 2 as the sample size increases. Thus in this 
case we cannot say unambiguously that the significance level decreases 
with sample size. 

The common procedure of using a constant level of significance in hypothesis 
testing irrespective of sample size has been often criticized on grounds that 
with a sufficiently large sample every null hypothesis can be rejected. The pro¬ 
cedure increasingly distorts the interpretation of data against a null hypothesis 
as the sample size grows. The significance level should, consequently, be a 
decreasing function of sample size. 42 As we have seen, the criteria for choice 
of regressors imply a decreasing significance level as the sample size increases. 
However, some Bayesian arguments have led to more substantial changes in 
the significance levels with sample size than are implied by the criteria in Table 
12.4. We will discuss one such criterion, that of Learner, 43 but before that we 
will explain briefly what the Bayesian approach is. 


42 This argument was made in D. V. Lindley, “A Statistical Paradox,” Biometrika, Vol. 44, 1957, 
pp. 187-192, and since then in many Bayesian papers on model selection. 

43 Leamer, Specification Searches, pp. 114-116. 
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Bayes’ Theorem and Posterior Odds 
for Model Selection 


Bayes’ theorem is based on the definition of conditional probability. Let £, and 
E 2 be two events. Then by the definition of conditional probability, we have 


Hence 


£(£, | E 2 ) 


£(£,£;) 

P(E 2 ) 


and 


P(E 2 | £,) 


£(£,£ 2 ) 

£(£,) 


£(£, | £ 2 ) 


P(E 2 1 £,)£(£,) 
P(E 2 ) 


Now substitute El (hypothesis about the model that generated the data) for £, 
and D (observed data) for £ 2 . Then we have 


P{H | D) 


P(D 1 H){P{H) 
P(D) 


(12.15) 


Here P(D \ H) is the probability of observing the data given that H is true. This 
is usually called the likelihood. P(H) is our probability that H is true before 
observing the data (usually called the prior probability). P(H \ D) is the prob¬ 
ability that H is true after observing the data (usually called the posterior prob¬ 
ability). P(D) is the unconditional probability of observing the data (whether H 
is true or not). Often P(D) is difficult to compute. Hence we write the relation 
above as 


P(H | D) oc P(D | H)P(H) 


That is: Posterior probability varies with (or is proportional to) likelihood times 
prior probability. 

If we have two hypothesis H { and H 2 , then 


P(H\ I D) 

Hence 


P(D | //,)£(//,) 
P(D) 


and 


P(H 2 I D ) 


P(D | H 2 )P{H 2 ) 
P(D) 


P(H\ | D) P(D | H x ) P(Hi) 
P(H 2 I D) P(D | Ho) P(H 2 ) 


(12.16) 


The left-hand side is called posterior odds. The first term on the right-hand side 
is called the likelihood ratio, and the second term on the right-hand side is 
called the prior odds. Thus we have: 


Posterior odds equals likelihood ratio times prior odds. 

In the problem of choice of regressors, H , and H 2 involve some parameters, 
say P and y. The likelihood ratio is computed by a weighting procedure, the 
weights being determined by the prior distributions of these parameter sets. 
Thus we have 
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P(H t I data) = J L,P,(P) d(i P{H h ) 

P(H 2 | data) f L 2 P 2 (y) dy P(H 2 ) 

where P, and L 2 are the respective likelihoods and P, and P 2 are the respective 
prior distributions for the parameters in models 1 and 2. The first term on the 
right-hand side is sometimes called the Bayes factor. 

There have been many suggestions for the literature for P, and P 2 and thus 
for the computation of Bayesian posterior odds. 44 Since our purpose here is to 
illustrate how the implied P-ratio changes with the sample size according to 
some Bayesian arguments, we will present one formula suggested by Learner. 
He suggests that the posterior probabilities should satisfy the following prop¬ 
erties: 


1. There must be no arbitrary constants. 

2. The posterior probability of a model should be invariant to linear trans¬ 
formations of the data. 

3. There should be a degrees of freedom adjustment: of two models that 
both yield the same residual sum of squares, the one with the fewer ex¬ 
planatory variables should have the higher posterior probability. 


Based on these criteria, Learner suggests prior distributions P, and P 2 and 
computes the posterior odds as prior odds times the Bayes factor given by (in 
the notation we are using) 


B = 


RRSS \ 

URSS/ 




( 12 . 18 ) 


We say that the evidence favors the restricted model if B < 1. Using the P-ratio 
defined in (12.14), we get the condition as 

P < '- (nW" - 1) (12.19) 

K 2 

Compared with the P-ratios presented in Table 12.4, this criterion produces 
large changes in the P-ratios as the sample size n changes. The P-ratios Lea¬ 
rner’s criterion implies are presented in Table 12.5. 


12.9 Cross-Validation 


One of the major criticisms of the different criteria for selection of regressors 
that we discussed in the preceding section is that a model we choose may be 
the best for the period used in estimation but may not be the best when we use 
it for prediction in future periods. To handle this criticism, it is often suggested 
that we use only part of the data for the purpose of estimation and save the 


■“An early survey is in K. M. Gaver and M. S. Geisel, “Discriminating Among Alternative 
Models: Bayesian and Non-Bayesian Methods,” in P. Zarembka (ed.). Frontiers of Economet¬ 
rics (New York: Academic Press, 1974). 
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Table 12.5 Critical F -Values Implied by Bayesian Posterior Odds Criterion 



n - k 

5 

10 

50 

100 

k 2 = 1 

k = 1 

1.74 

2.44 

4.01 

4.68 


3 

1.48 

2.18 

3.89 

4.60 


5 

1.29 

1.98 

3.78 

4.53 


10 

0.99 

1.62 

3.53 

4.37 


5% point of F 

6.60 

4.96 

4.03 

3.94 

•i'¬ 

ll 

k = 1 

2.42 

3.08 

4.34 

4.90 


3 

1.97 

2.69 

4.20 

4.82 


5 

1.66 

2.40 

4.07 

4.74 


10 

1.20 

1.89 

3.79 

4.56 


5% point of F 

5.41 

3.71 

2.79 

2.70 

ii 

■4? 

k~ 1 

3.45 

3.95 

4.70 

5.13 


3 

2.67 

3.36 

4.54 

5.05 


5 

2.16 

2.93 

4.40 

4.96 


10 

1.47 

2.23 

4.07 

4.76 


5% point of F 

5.05 

3.33 

2.40 

2.30 


Source: E. E. Learner, Specification Searches, Wiley, New York, 1978, p. 116. 


rest for the purpose of prediction and to check the adequacy of the model cho¬ 
sen. We estimate the different models using the first part and then use the es¬ 
timated parameters to generate predictions for the second part. The model that 
minimizes the sum of squared prediction errors is then chosen as the best 
model. 

This procedure of splitting the data into two parts—one for estimation and 
the other for prediction—is called cross-validation. Actually, we have two sets 
of prediction errors. The prediction errors for the first part (estimation period) 
are known as within sample prediction errors. The sum of squares of these 
prediction errors is the usual residual sum of squares. The prediction errors 
from the second part are known as the out-of-sample prediction errors. Differ¬ 
ent criteria in cross-validation depend on different weights given to the sums 
of squares of these two sets of prediction errors. A criterion often used is to 
give equal weights to these two sets of prediction errors. 

What the cross-validation procedure does is to impose a penalty for param¬ 
eter instability. If the parameters are not stable between the estimation period 
and prediction period, the sum-of-squared out-of-sample prediction errors will 
be large even if the sum of squares of within-sample prediction errors is small. 
Thus the procedure of model selection by cross-validation implies choosing the 
model that minimizes: the usual residual sum of squares plus a penalty for pa¬ 
rameter instability. 

Instead of splitting the data into two sets and using one for estimation and 
the other for prediction, we can follow the procedure of using one observation 
at a time for prediction (and the remaining observations for estimation). That 
is, in fact, the idea behind “predicted residuals” discussed earlier in Section 
12.4. There the sum of squares of the predicted residuals, PRESS, was sug- 



506 12 diagnostic checking, model selection, and specification testing 


gested as a criterion for model choice. Note that at each stage (n — 1) obser¬ 
vation are within sample (i.e., used for estimation) and the remaining obser¬ 
vation is out-of-sample (i.e., used for prediction). PRESS is the sum of squares 
of out-of-sample prediction errors. 

Instead of using PRESS (the sum of squares of predicted residuals) as a cri¬ 
terion for model choice, we can consider the sum of squares of studentized 
residuals as a criterion of model choice. 45 As discussed in Section 12.4, the 
studentized residual is just the predicted residual divided by its standard error. 
The sum of squares of studentized residuals can be denoted by SSSR and the 
sum of squares of predicted residuals as SSPR. This would be a better termi¬ 
nology, in keeping with the use of the term “residual” for an “estimated error.” 
From the derivations in Section 12.3 we note that both SSPR and SSSR are 
weighted sums of squares of the least squares residuals u,. The R 2 criterion also 
involves minimizing a weighted sum of squares of least squares residuals (we 
minimize 2 «?/d.f.). To be a valid criterion of model choice we should be able 
to show that the expected value of the quantity minimized is less for the true 
model than for the alternative models. If this is not the case, the criterion does 
not consistently select the true model. Learner 46 shows that the R 2 criterion, 
and the SSSR criterion, which minimizes sum of squares of studentized resid¬ 
uals (this is Schmidt’s SSPE criterion), satisfy this test and are valid criteria, 
but that the PRESS criterion and other criteria suggested in cross-validation 
are not valid by this test. Thus if one is interested in using predicted residuals 
for model choice, the best procedure appears to be not one of splitting the 
sample into two parts but to derive studentized residuals (the SAS regression 
program gives them) and consider minimization of the sum of squares of stu¬ 
dentized residuals (SSSR) as a criterion of model choice (i.e., use Schmidt’s 
SSPE criterion). 


12.10 Hausman’s Specification Error Test 


Hausman’s specification error test 47 is a general and widely used test for testing 
the hypothesis of no misspecification in the model. 

Let // 0 denote the null hypothesis that there is no misspecification and let //, 
denote the alternative hypothesis that there is a misspecification (of a particular 
type). For instance, if we consider the regression model 

y = (3x + u (12.20) 


45 This is the criterion proposed in Peter Schmidt, “Choosing Among Alternative Linear Regres¬ 
sion Models,” Atlantic Economic Journal, 1974, pp. 7-13. 

46 E. E. Learner, “Model Choice and Sepcification Analysis,” in Z. Griliches and M. D. Intril- 
ligator (eds.). Handbook of Econometrics, Vol. I (Amsterdam: North-Holland, 1983), Chap. 5, 
pp. 285-330. 

47 J. A. Hausman, “Specification Tests in Econometrics,” Econometrica, Vol. 46, No. 6, No¬ 
vember 1978, pp. 1251-1271. 
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in order to use the OLS procedure, we specify that x is independent of u. Thus 
the null and alternative hypotheses are: 

H 0 : x and u are independent 

H{. x and u are not independent 

To implement Hausman’s test, we have to construct two estimators 0 O and 0,, 
which have the following properties: 

0 O is consistent and efficient under H 0 but is not consistent under H x . 

0, is consistent under both H 0 and H x but is not efficient under H 0 . 

Then we consider the difference q = 0] - 0 O . Hausman first shows that 

var(<?) = V, - Vo 

where V, = var(0,) and V„ = var(0 o ), both variances being computed under 
H 0 . Let V(q) be a consistent estimate of var(g). Then we use 


as a x 2 with 1 d.f. to test H 0 against //,. This is an asymptotic test. 

We have considered only a single parameter 0. In the general case where 0 
is a vector of k parameters, V, and V 0 will be matrices, 0,, 0 O , and q will all be 
vectors and the Hausman test statistic is 

m = q'[V(4)\-'$ 

which has (asymptotically) a x 2 -distribution with degrees of freedom k. 

Since a consideration of the ^-parameter case involves vectors and matrices, 
we discuss the single-parameter case. The derivations in the ^-parameter case 
are all similar. 

To prove the result var(<?) = V, - V 0 , we first have to prove the result that 

cov(0o, q) = 0 

The proof proceeds as follows. Under H 0 , both 0 O and 0, are consistent esti¬ 
mates for 0. Hence we get 

plim q = plim 0, - plim 0 O = 0 - 0 = 0 

Consider a new estimator for 0 defined by 

^ = 0 O + 

where \ is any constant. Then plim <? = 0. Thus is a consistent estimator of 
0 for all values of X. 

V0) = V 0 + X 1 var {q) + 2\ cov(0 o , q) a V„ 

Since 0 O is efficient. Thus 

X 2 var(c/) + 2X cov(0 o , q) s 0 


( 12 . 21 ) 
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for all values of X. We will show that the relationship (12.21) can be satisfied 
for all values of X only if cov(p 0 q) = 0. 

Suppose that cov(j3 0 , q) > 0. Then by choosing X negative and equal to 
-cov(0 o , q)/var(q), we can show that the relationship (12.21) is violated. Thus 
cov(|3o, q) is not greater than zero. 

Similarly, suppose that covfl3 0 , q) < 0. Then by choosing X positive and 
equal to -cov(p 0 , q)/var(q), we show that the relationship ( 12 . 21 ) is violated. 
Thus cov(p„, q) cannot be greater than or less than zero. Hence we get cov 
(Po, q) = 0 . 

Now since P, = p 0 + <7 and cov(p 0 , £j) = 0, we get 
var(p,) = var(p 0 ) + var(q) 


or 

var (q) = var(P,) - var(p 0 ) = V, - V 0 
which is the result on which Hausman’s test is based. 


An Application: Testing for Errors in 
Variables or Exogeneity 


Consider now the model given by equation (12.20). The model can be regarded 
as an errors-in-variables model (Chapter 11), where x is correlated with the 
error term u because it is an error-ridden explanatory variable. In this case our 
interest is in testing whether there is an error in this variable or not. 

Alternatively, equation (12.20) can be regarded as one equation in a simul¬ 
taneous equations model (Chapter 9) and x is correlated with u because it is an 
endogenous variable. We are interested in testing whether x is exogenous or 
endogenous. If x is not correlated with u, we are justified in estimating the 
equation by OLS. 

Under H u p 0 is not consistent. To get a consistent estimator for p we have 
to use the instrument variable (IV) method. Let us denote the instrumental 
variable by z. Then the IV estimator is 


P, 

Vi 


1 


yz 


P + 


I 


uz 


2 xz 

var(Pi) = ct 2 


S- 

2 


(2 xzf 


P, is consistent under both //„ and //,. It is, however, less efficient than p 0 , 
under H 0 . 

Defining q = p, - p 0 , we have 


2 z 2 i 

.(2 xzf 2 * 2 - 


var {q) = - V 0 = or 2 
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where r 2 is the squared correlation between x and the instrumental variable z. 48 
The test statistic is 


m 


which we use as x 2 with 1 d.f. 


4 2 r z 


(1 - r z ) t> 0 


Some Illustrative Examples 


As an illustration, let us consider the data on Australian wine industry in Chap¬ 
ter 9 (Table 9.2). Let us consider the equation (all variables in log form) 


Q, = a + PpJ' + u, 

We are interested in testing the hypothesis that p" is exogenous. The OLS es¬ 
timator of (3 is 


1.78037 

0.507434 


3.5085 


Using the index of storage costs s as an instrumental variable, we get the IV 
estimator as 


2.75474 

0.500484 


5.4862 


q = P, - Po = 1-9777 

The squared correlation coefficient between p" and s, is r 2 = 0.408. We also 

have 


a 2 = 0.09217 


and 


0.09217 

0.507434 


The test statistic is 


m 


m 2 


(1 - r 2 )V 0 


= 14.84 


0.18164 


This is significant at the 1% level (from the x 2 -tables with 1 d.f.). Thus we 
cannot treat p" as exogenous and we have to consider a simultaneous equations 
model. 


48 Note that both V, and V„ depend on <r. But when we get an estimate of var(<J), the estimate 
of cr we use is the one obtained under N 0 . 
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Let us see what happens if we use per capita disposable income y, as the 
instrumental variable. In this case we have 


0 . 


<? 

The test statistic is now 


2.215 

0.4592 

0.624 

4.8236 


4.8236 

3.5085 


1.3151 


m 


(1.3151) 2 (0.624) 

0.376(0.18164) 


Again, this is significant at the 1% level, indicating that we have to treat p" as 
endogenous. 

As yet another example, consider the demand and supply model for pork 
considered in Chapter 9 (data in Table 9.1). Let us consider the hypothesis that 
Q is exogenous in the demand function (normalized with respect to P). 


P = a + pg + yV + u 


The OLS estimate of p is 


P = -1.2518 with SE = 0.1032 


The 2SLS estimate of p is 


P = -1.2165 

Thus 

q = P - p = 0.0353 
V 0 = (0.1032) 2 = 0.01065 

The instrumental variable that is implied by the use of 2SLS is Q, the predicted 
value of Q from the reduced form. The squared correlation between Q and Q 
is the R 2 from the reduced-form equation for Q, which is 0.898. Hence the test 
statistic is 


(0.0353) 2 (0.898) 

rn = - = 1 Uj 

(1 - 0.898X0.01065) 

which is not significant even at the 25% level (from the x 2 -tables with 1 d.f.). 

Thus we can treat Q as exogenous and estimate the demand equation by 
OLS. Note that this was our conclusion in Section 9.3. 

An Omitted Variable Interpretation 
of the Hausman Test 

There is an alternative way of implementing the Hausman test. Returning to 
equation (12.20), let x be the predicted value of x from a regression of x on the 
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instrumental variable z, and v‘ the estimated residual. That is, i? = x - x. Then 
estimate by OLS the equation 

y = (3x + yx + e 


or 


y = (3x + yt> + e 

and test the hypothesis y = 0. These equations are known as expanded regres¬ 
sions. This test is identically the same as the test based on the statistic m de¬ 
rived earlier. 

We will not prove this result 49 but we will explicitly state what the OLS es¬ 
timate y will turn out to be. We have 

. = Pi ~ Po = g 

^ 1 - r 2 1 - r 2 

where r 1 is the squared correlation between x and z- Also, 


var(y) = 


V 0 


r 2 (l - r 2 ) 


Thus 


var(y) (1 - r 2 )V 0 
the statistic obtained earlier. 

This omitted variable interpretation enables us to generalize the test to 
the case where we have more variables whose exogeneity we are interested in 
testing. 

Suppose that one of the equations in a simultaneous equations model is 

yi = P^ v P 3 y 3 a iZi + «, (12.22) 

We want to test the hypothesis that y 2 and y 3 are exogenous (i.e., independent 
of «,), the alternative hypothesis being that they are not. The variable z, is 
considered exogenous. 

To apply the test, we need to have at least two instrumental variables, say z 2 
and z 3 . But there could be more in the system. Then what we do is regress y 2 
and y 3 on the other exogenous variables in the system and obtain the predicted 
values y 2 and y 3 . Now we estimate the expanded regression equation 

y, = Pzy 2 + p 3 y 3 + «iZi + 7^2 + y 3 y 3 + 8 (12.23) 

by ordinary least squares and test the hypothesis y 2 = y 3 = 0 (by the methods 
described in Chapter 4). This is the Hausman test for exogeneity. 

Let us define v 2 = y 2 - y 2 and v 3 = y 3 - y 3 . Then, instead of (12.23), we can 
also estimate 


49 For a proof, see Hausman, “Specification Tests,” p. 1261. 
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7 i = PzJz + 03^ + «iZi + 72^2 + 73^3 + 8 (12.24) 

by OLS and test the hypothesis 7 2 = 73 = 0. 

This procedure applies only if we are testing the exogeneity of all the vari¬ 
ables under question and does not apply for testing the exogeneity of a subset 
of these variables. For instance, in equation (12.22) suppose we assume that y 2 
is endogenous and z, is exogenous. What we want to do is test the hypothesis 
that y 3 is exogenous. Now we cannot just estimate equations like (12.23) and 
(12.24) and just test 7 , = 0. Nor does the equivalence between (12.23) and 
(12.24) apply anymore . 50 

In this case we have to obtain the 2SLS estimates of (3 3 under two assump¬ 
tions: 


1. y 3 is exogenous. Call this estimate 0 3 with variance V 0 . 

2. y 3 is endogenous. Call this estimate 0, with variance V,. 


Then q = 0 3 — (3 3 and var (q) = V, — V {) . We now apply the Hausman test 
as explained earlier. Note that both Vj and V 0 depend on ct 2 = var(iq). How¬ 
ever, when we get an estimate of var (q) we have to use the estimate of ct 2 under 
H 0 , that is, from the 2SLS estimation treating y 3 as exogenous. 

The omitted variable method in this case is somewhat complicated, but con¬ 
sists of the following steps: 


1. First we get the reduced-form residuals for the endogenous variables y 2 
and y 3 as before. Call these residuals v 2 and v' 3 . However, we do not use 
these as the omitted variables. We construct linear combinations of these 
as explained in the next step. 

2. Form the covariance matrix of these residuals and get its inverse. Denote 
this by 


( C 22 

\C 32 


C 23 \ 

C 33 J 


In our simple case we have 

2 vi 


c„ = 


Cv 


5 > 


C r 


Co = 


where A = X vj v\ - (^ v 2 v 3 ) 2 . 

3. Define 


v 2 v 3 

A 


x 2 — C 22 v 2 + C 23 V 3 
■*3 = C 32 v 2 + C 33 v 3 


These are the missing variables we use. 

4. Estimate the following equation by OLS. 


>’l = P2.V2 + P3V3 + «|Z. + I2X2 + 73*3 + 8 


'"This is pointed out in Alberto Holly, “A Simple Procedure for Testing Whether a Subset of 
Endogenous Variables Is Independent of the Disturbance Term in a Structural Equation,” dis¬ 
cussion paper. University of Lausanne, November 1982. 
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and test the hypothesis = 0. This is the required test for the hypothesis 
that y, is exogenous. 

The test is actually not very complicated. In fact, step 1 has to be used in 
any 2SLS estimation. Only step 2 involves matrix inversion. Once this is done, 
step 3 is easy and step 4 is just OLS estimation. 

12.11 The Plosser—Schwert—White 
Differencing Test 

The Plosser-Schwert-White (PSW) differencing test 51 is, like the Hausman 
test, a general test for model misspecification, but is applicable for time-series 
data only. The test involves estimation of the regression models in levels and 
in first differences. If the model is correctly specified, the estimators from the 
differenced and undifferenced models have the same probability limits and 
hence the results should corroborate one another. On the other hand, if there 
are specification errors, the differenced regression should lead to different re¬ 
sults. The PSW test, like the Hausman test, is based on a comparison from the 
differenced and undifferenced regressions. 

Davidson, Godfrey, and MacKinnon 52 show that, like the Hausman test, the 
PSW test is equivalent to a much simpler omitted variables test, the omitted 
variables being the sum of the lagged and one-period forward values of the 
variables. 

Thus if the regression equation we are considering is 

7, = 0,*,, + 0 2 * 2 , + u, 

the PSW test involves estimating the expanded regression equation 
7 , = 01 * 1 , + 02 * 2 , + 71 * 1 , + 72 * 2 , + «, 

where 


*1, = *!,,+ ! + *l.,-l 
*2, = *2.<+l + *2,,-l 

and testing the hypothesis 7 i = 7 2 = 0 by the usual F-test. 

If there are lagged dependent variables in the equation, the test needs a minor 
modification. Suppose that the model is 

y, = 0 i;y,-i + 02*, + «, 

Now the omitted variables would be defined as 

*i, = y, + y,- 2 

5I C. I. Plosser, G. W. Schwert, and H. White, “Differencing as a Test of Specification,” Inter¬ 
national Economic Review, October 1982, pp. 535-552. 

”R. Davidson, L. G. Godfrey, and J. G. MacKinnon, “A Simplified Version of the Differencing 
Test,” International Economic Review, October 1985, pp. 639-647. 
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and 


Zit -*7— i 

There is no problem with z 2 , but z u would be correlated with the error term it, 
because of the presence of y, in it. The solution would be simply to transfer it 
to the left-hand side and write the expanded regression equation as 

(i - yi)y, = Pi> ! , i + p 2 *, + 7i>'/ z + yiz 2 , + 

This equation can be written as 

y, = p;>v-i + + iiz 2l + u, 

where all the starred parameters are the corresponding unstarred ones divided 
by (1 - 7i)- 

The PSW test now tests the hypothesis yj = 72 = 0- Thus, in the case where 
the model involves the lagged dependent variable y, , as an explanatory vari¬ 
able, the only modification needed is that we should use y, 7 as the omitted 
variable, not (y, + y,_ 2 ). For the other explanatory variables, the correspond¬ 
ing omitted variables are defined as before. Note that it is only y ,_, that creates 
a problem, not higher-order lags of y„ like y,_ 2 , y, and so on. 

12.12 Tests for Nonnested Hypotheses 


Consider the problem of testing two hypotheses: 

H 0 : y = pjc + m 0 Uq IN(0, oj) (12.25) 

H { : y = 7 z + u, w, ~ IN(0, a?) (12.26) 

The hypotheses are said to be nonnested since the explanatory variables under 
one of the hypotheses are not a subset of the explanatory variables in the other. 

It is a common occurrence in economics that there are many competing eco¬ 
nomic theories trying to explain the same variable (consumption, investment, 
etc.), and the explanatory variables in the different theories contain nonover¬ 
lapping variables. An extreme example is the paper by Friedman and 
Meiselman 51 around which there was a considerable amount of controversy in 
the 1960s. The Keynesian and monetarist theories were formulated in their 
simplified form as 

C, = a 0 + p 0 A, + u 0 , (Keynesian) 

C, = a, 4- 3iAf( + U\t (Monetarist) 

where C, = consumption expenditure in constant dollars 
A, = autonomous expenditure in constant dollars 
M, — money supply 


"M. Friedman and D. Metselman, “The Relative Stability of Monetary Velocity and the In¬ 
vestment Multiplier in the U.S. 1897-1958.” in Stabilization Policies (Commission on Money 
and Credit) (Englewood Cliffs, N.J.: Prentice Hall, 1963). 
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The Davidson and MacKinnon Test 

There have been several tests that have been suggested for testing such non¬ 
nested hypotheses . 54 The first such test is the Cox test. However, we will out¬ 
line some simple alternative and asymptotically equivalent tests suggested by 
Davidson and MacKinnon 55 that are easy to apply. Their test procedure for 
testing H 0 against //, (// 0 is the maintained hypothesis) is as follows: 

1. Estimate (12.26) by OLS. Let y, = yz be the predicted value of y. 

2. Estimate the regression equation 

y = 0x + ay, + u (12.27) 

and test the hypothesis a = 0 . 

If the hypothesis is not rejected, then H 0 is not rejected by H x . If the hypoth¬ 
esis is rejected, then H 0 is rejected by //,. 

A test of H x against H e would be based on analogous steps. 

1. Estimate (12.25) by OLS. Let y 0 = 0x be the predicted value of y. 

2. Estimate the regression equation 

y = yz + 8 y 0 + v (12.28) 

and test the hypothesis 8 = 0. 

If the hypothesis is not rejected, then H x is not rejected by H a . If the hypoth¬ 
esis is rejected, then //, is rejected by H 0 . 

The outcome of these two tests can be summarized as follows: 


Hypothesis: 

8 = 0 

Hypothesis: ct = 0 

Not Rejected 

Rejected 

Not rejected 

Both H 0 and //, 

H x is acceptable 


are acceptable 

H 0 is not 

Rejected 

H 0 is acceptable 

Neither H 0 nor H x 


H\ is not 

is acceptable 


The motivation behind the tests suggested by Davidson and MacKinnon is 
the following: The models given by H 0 and //, can be combined into a single 
model 


y = (1 — a)0x + a(yz) + u (12.29) 

and testing a = 0 versus a = 1. However, there is no way of estimating 0, -y, 
and a from this model. All we get are estimates of (1 - a)0 and ay. Davidson 


54 Two surveys of this extensive literature are: J. G. MacKinnon, “Model Specification Tests 
Against Non-nested Alternatives” (with discussion). Econometric Reviews, Vol. 2, 1983. pp. 
85-158, and M. McAleer and M. H. Pesaran. “Statistical Inference in Non-nested Econometric 
Models,” Applied Mathematics and Computation, 1986. 

"R. Davidson and J. G. MacKinnon, “Several Tests for Model Specification in the Presence of 
Alternative Hypotheses,” Econometrica, Vol. 49, 1981, pp. 781-793. The survey paper by 
MacKinnon contains additional references to the papers by Davidson and MacKinnon. 
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and MacKinnon show that we can substitute y, = yz for yz in (12.29) and then 
test a = 0. They show that under H 0 , a. from (12.27) is asymptotically N(0, 1). 
They call this the 7-test (because a and p are estimated jointly). Note that the 
7-tests are one degree of freedom tests irrespective of the number of explana¬ 
tory variables in H 0 and H t . 

The relationship between the 7-test and optimum combination of forecasts is 
as follows: We have two forecasts of y from the two models (12.25) and (12.26). 
Call these y 0 and y„ respectively. The two forecasts can be combined to pro¬ 
duce a forecast with a smaller forecast error . 56 The combination can be written 
as 

$ = (1 ~ <*)$o + <*Si 

The value of a that gives the minimum forecast error is 

= X? - pX,X 2 
x? + X-2 _ 2pX,X 2 

X, and X 2 are, respectively, the variances of the forecast errors (y - y„) and 
(y — yd and p is the correlation between these forecast errors. Also, 

aSO if and only if ^ I p 
X 2 

Corresponding to the optimum combination of forecasts, equation (12.29) 
can also be written as 


y = (1 - a)y 0 + ay, + v 

where v is the error term. This can also be written as 

y - yo = «(y, - y 0 ) + v 

If we estimate this equation and the estimate of a is not significantly different 
from zero, we can say that H y does not add anything to explaining y over //„. 
If a is significantly different from zero, H y explains y over and above H 0 . 

We will not pursue the extensions of the 7-test here . 57 We will, however, 
discuss a few of the limitations . 58 One limitation of the test is that sometimes it 
rejects both H 0 and //, or accepts //„ and H t . Although this conclusion suggests 


“See C. W. J. Granger and P. Newbold, Forecasting Economic Time Series, 2nd ed. (New York: 
Academic Press, 1986), pp. 266-267 on combination of forecasts. 

,7 McAleer and Pesaran, “Statistical Inference.” G. R. Fisher and M. McAleer, “Alternative 
Procedures and Associated Tests of Significance for Non-nested Hypotheses,” Journal of 
Econometrics, Vol. 16, 1981, pp. 103-119, suggest what is known as the JA test. Unlike the 
7-test, which is asymptotic, the JA test is a small-sample test (if x and z are fixed in repeated 
samples). 

' 8 A good review and discussion of the limitations of nonnested tests is L. G. Godfrey, “On the 
Use of Misspecification Checks and Tests of Non-nested Hypotheses in Empirical Economet¬ 
rics,” Economic Journal, Supplement, 1985, pp. 69-81. 
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that we should go back and examine both models, in many cases we would like 
to have some ranking of the models. 

An alternative procedure is to embed the two models given by 77 0 and //, 
into a comprehensive model 

y = px + 7 z + u t (12.30) 

and testing H (] by testing 7 = 0 and testing 77, by testing (3 = 0. When there 
are more than one variable in x and z, these tests will be F-tests. This is con¬ 
trast to the 7-test, which is a one-degree-of-freedom test whatever the number 
of explanatory variables in //„ and 77,. 

The Encompassing Test 

At first sight it would appear that there is no relationship between the F-tests 
and the 7-test. This is not so. Mizon and Richard 59 suggest a more general test 
called the “encompassing test” of which the F-test and 7-test are special cases. 
The encompassing principle is based on the idea that a model builder should 
analyze whether his model can account for salient features of rival models . 60 
The encompassing test is a formulation of this principle. If your model is spec¬ 
ified by 77 0 and the rival model by //,, a formal test of 77 0 against 77, is to 
compare 7 and a 2 , obtained under 77, from equation (12.26), with the probabil¬ 
ity limits of these parameters under your hypothesis 77 0 . Comparing 
7 with plim 7 1 77 0 gives the mean encompassing test. Comparing oj 
with plim of | 77 0 gives the variance encompassing test. Mizon and Richard 
show that the F-test is a mean encompassing test and the 7-test is a variance 
encompassing test. This explains why the 7-test is a one-degree-of-freedom test 
no matter how many explanatory variables there are in the models given by 77 0 
and 77,. The complete encompassing test (CET) suggested by Mizon and Rich¬ 
ard is a joint test that compares 7 , and cr\ with their probability limits under 
77 0 . A discussion of this joint test is beyond the scope of this book. Further, 
there is not much empirical evidence on it. The F-test and 7-test, on the other 
hand, can be very easily implemented. 

In summary, to test the nonnested hypothesis 77 0 against 77,, we need to 
apply two tests: 

1 . The 7-test testing a = 0 based on equation (12.27). 

2. The F-test testing 7 = 0 in the comprehensive model (12.30). 

As to how these tests perform in small samples, some experimental evidence 
suggests that the one-degree-of-freedom 7 -test is better than the F-test at re- 


5, G. E. Mizon and J. F. Richard, “The Encompassing Principle and Its Application to Testing 
Non-nested Hypotheses,” Econometrica, Vol. 54, 1986, pp. 657-678. 

M J. E. H. Davidson and D. F. Hendry, “Interpreting Econometric Evidence: The Behavior of 
Consumers’ Expenditures in the U.K.,” European Economic Review, Vol. 16, 1981, pp. 179- 
198. 
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jecting the false model. However, this is no consolation because it has also been 
found that the ./-test tends to reject the true model too frequently with the 
estimated small-sample significance levels being so large that it is often difficult 
to justify asymptotically valid critical values. 61 Yet another shortcoming of the 
nonnested tests is that if both H 0 and //, are false, these tests are inferior to 
standard diagnostic tests (discussed in earlier sections of this chapter). 62 

What all this suggests is that in testing nonnested hypotheses, one should 
use the ./-test with higher significance levels and supplement it with the F -test 
on the comprehenesive model and standard diagnostic tests. 


A Basic Problem in Testing Nonnested Hypotheses 

The specification implied by equations (12.25) and (12.26) does not by itself 
enable us to make a valid comparison of the two models. In fact, this is the 
problem with testing nonnested models. Equation (12.25) specifies the condi¬ 
tional distribution of y given x. Similarly, equation (12.26) specifies the condi¬ 
tion distribution y given z. How can we compare these two conditional distri¬ 
butions unless the role of z under // 0 and the role of x under H x are specified? 
What we are given are two noncomparable conditional distributions. H 0 does 
not specify anything about the relationship between y and z. H x does not spec¬ 
ify anything about the relationship between y and x. Thus to compare the two 
models given by (12.25) and (12.26), we should be able to derive the conditional 
distributions of/(_y | x) and g(y | z) under both H 0 and //,. To do this we have 
to supplement equations (12.25) and (12.26) by more equations. This is what is 
done by Mizon and Richard. 


Summary 


1. Diagnostic tests are tests that “diagnose” specific problems with the 
model we are estimating. The least squares residuals can be used to diagnose 
the problems of heteroskedasticity, autocorrelation, omission of relevant vari¬ 
ables, and ARCH effects in errors (see Section 12.2). Tests for heteroskedas¬ 
ticity and autocorrelation have been discussed in Chapters 5 and 6, respec¬ 
tively. To test for omitted variables, we first estimate the equation by OLS and 
use yj and higher powers of y, as additional variables, reestimate the equation, 
and test that the coefficients of these added variables are zero. A test based on 

6l Godfrey, “On the Use,” p. 76. 

“Godfrey, “On the Use,” p. 79. Godfrey quotes N. R. Erickson’s Ph.D. thesis (1982) at the 
London School of Economics, which provides a numerical example in which Land one-degree- 
of-freedom tests reject only one of the two competing models, but misspecification tests deci¬ 
sively reject //„, H„ and the comprehensive model. 
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a regression of u, the estimated residual on y 2 and higher powers of $, is not a 
valid test. To test for ARCH effects, we estimate the regression 

u) = 'Yo + + e, 

and test -y, = 0. 

2. There are two problems with the least squares residuals. They are het- 
eroskedastic and autocorrelated even if the true errors are independent and 
have a common variance of cr (see Section 12.3). To solve these problems, 
some alternative residuals have been suggested (see Section 12.4). Two such 
residuals, which are uncorrelated and have a common variance, are the BLUS 
residuals and recursive residuals. The BLUS residuals are more difficult to 
compute than the recursive residuals; moreover, in tests of heteroskedasticity 
and autocorrelation, their performance is not as good as that of least squares 
residuals. Hence we consider only recursive residuals. Two other types of re¬ 
siduals are the predicted residuals and studentized residuals. Both of these are, 
like the least squares residuals, heteroskedastic and autocorrelated. However, 
some statisticians have found the predicted residuals useful in choosing be¬ 
tween regression models and the studentized residuals in the detection of out¬ 
liers. The predicted residuals, studentized residuals, and recursive residuals 
can all be computed using the dummy variable method. 

3. The usual approach to outliers based on least squares residuals involves 
looking at the OLS residuals, deleting the observations with large residuals, 
and reestimating the equation. During recent years an alternative criterion has 
been suggested. This is based on a criterion called DFFITS which measures 
the change in the fitted value y of y that results from dropping a particular 
observation. Further, we do not drop outliers. We merely see that their influ¬ 
ence is minimized. This is what is known as bounded influence estimation (Sec¬ 
tion 12.5). The SAS regression program computes DFFITS. 

4. According to Learner, the process of model building involves several 
types of specification searches. These can be classified under the headings: 
hypothesis-testing search, interpretive search, simplification search, proxy 
variable search, data selection search, and post-data model construction. 

5. Hendry argues that most of empirical econometric work starts with very 
simplified models and that not enough diagnostic tests are applied to check 
whether something is wrong with the maintained model. His suggested strategy 
is to start with a very general model and then progressively simplify it by ap¬ 
plying some data-based simplification tests. (The arguments for this are dis¬ 
cussed in Section 12.6.) 

6. A special problem in model selection that we often encounter is that of 
selection of regressors. Several criteria have been suggested in the literature: 
the maximum R 2 criterion and the predictive criteria PC, C p , and S p . These 
criteria are summarized in Section 12.7. There is, however, some difference in 
these criteria. In the case of the maximum R 2 criterion what we are trying to 
do is pick the “true” model assuming that one of the models considered is the 
“true” one. In the case of prediction criteria, we are interested in parsimony 
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and we would like to omit some of the regressors if this improves the MSE of 
prediction. For this problem the question is whether or not we need to assume 
the existence of a true model. For the PC and C p criteria we need to assume 
the existence of a true model. For the S p criterion we do not. For this reason 
the S p criterion appears to be the most preferable among these criteria. 

7. In Section 12.8 we discuss the critical /'-ratios implied by the different 
criteria for selection of regressors. We show that they stand in the relation 

F(R 2 ) < 1 < F(PC) < F(C P ) < 2 < F(S P ) 

We also present the F-ratios implied by Learner’s posterior odds analysis. This 
method implies that the critical F-ratios used should be higher for higher sam¬ 
ple sizes. For most sample sizes frequently encountered, F-ratios are > 2, 
which is another argument against R 2 criterion and the PC and C p criteria. 

8. Cross-validation methods are often used to choose between different 
models. These methods depend on splitting the data into two parts: one for 
estimation and the other for prediction. The model that minimizes the sum-of- 
squared prediction errors is chosen as the best model. A better procedure than 
splitting the data into two parts is to leave out one observation at a time for 
prediction, derive the studentized residuals (the SAS regression program gives 
these), and use the sum of squares of the studentized residuals as a criterion of 
model choice. 

9. A general test for specification errors that is easy to use is the Hausman 
test. Let H 0 be the hypothesis of no specification error and //, the alternative 
hypothesis that there is a specification error (of a particular type). We compute 
two estimators (3 n and (3, of the parameter (3. (3 0 is consistent and efficient under 
H 0 but is not consistent under //,. (3, is consistent under both H 0 and H, but is 
not efficient under H 0 . If d = (3, - (3 0 , then V(d) = V(p,) - V(j3 0 ). Haus- 
man’s test depends on comparing d with V(d). If we are interested in testing 
for errors in variables or for exogeneity, Hausman’s test reduces to a test of 
omitted variables. 

10. Another general specification test (applicable in time-series models) is 
the PSW differencing test. In this test we compare the estimators from levels 
and first difference equations. The idea is that if the model is correctly speci¬ 
fied, the estimates we get should not be widely apart. This test can also be 
viewed as an omitted-variable test (Section 12.11). 

11. Very often, the comparison of two economic theories implies the testing 
of two nonnested hypotheses. The earliest test for this is the Cox test, but the 
alternative (and asymptotically equivalent) /-test is easier to apply. It can also 
be viewed as an omitted-variable test. There are, however, some conceptual 
problems with nonnested tests. The two hypotheses H 0 and //, specify two 
conditional distributions with different conditioning variables [e.g.,/(y | x) and 
g(y | z)] and it does not really make sense to compare them. A proper way is 
to derive the conditional distributions /(y | x) and g(y | z) under both H 0 and 
//, and compare them, that is, we have to have the same conditioning variables 
under both the hypotheses. This is the idea behind the encompassing test. What 
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it amounts to is supplementing the 7-test with an F-test based on a comprehen¬ 
sive model. 


Exercises 


1. Explain the meaning of each of the following terms. 

(a) Least squares residual. 

(b) Predicted residual. 

(c) Studentized residual. 

(d) Recursive residual. 

(e) BLUS residual. 

(f) DFFITS. 

(g) Bounded influence estimation. 

(h) ARCH errors. 

(i) Post-data model construction. 

(j) Overparametrization with data-based simplification. 

(k) Posterior odds ratio. 

(l) Cross validation. 

(m) Selection of regressors. 

(n) Diagnostic checking. 

(o) Specification testing. 

(p) Hausman’s test. 

(q) 7-test. 

(r) Encompassing test. 

(s) Differencing test. 

2. Examine whether the following statements are true, false, or uncertain. 
Give a short explanation. If a statement is not true in general but is true 
under some conditions, state the conditions. 

(a) Since the least squares residuals are heteroskedastic and autocorre- 
lated, even when the true errors are independent and homoskedastic, 
tests for homoskedasticity and serial independence based on least 
squares residuals are useless. 

(b) There is a better chance of detecting outliers with predicted residuals 
or studentized residuals than with least squares residuals. 

(c) In tests of significance we should not use a constant level of signifi¬ 
cance. The significance level should be a decreasing function of the 
sample size. 

(d) To test for omitted variables, all we have to do is get some proxies 
for them, regress the least squares residual u, on the proxies, and test 
whether the coefficients of these proxies are zero. 

(e) The procedure of estimating the regression equation 


= 7o + 



522 12 DIAGNOSTIC checking, model selection, and specification testing 


where u, are the least squares residuals, does not necessarily test for 
ARCH effect. 

(f) The R 2 criterion of deciding whether to retain or delete certain re¬ 
gressors in an equation implies using a 5% significance level for test¬ 
ing that the coefficients of these regressors are zero. 

(g) The criteria for selection of regressors proposed by Amemiya, Hock¬ 
ing, and Mallows imply using a higher significance level for testing 
the coefficients of regressors than the R 2 criterion. 

(h) The R 2 , C p , and PC criteria for selection of regressors depend on the 
assumption that one of the models considered is a “true” model, 
whereas the S p criterion does not. Hence given the choice between 
these criteria, we should use only the S p criterion. 

(i) The maximum R 2 criterion and the predictive criteria PC, C p , and S p 
for selection of regressors are not really comparable since they an¬ 
swer different questions. 

(j) For comparing different models we should always save part of the 
sample for prediction purposes. 

(k) In the model 

7l = 02^2 + 03^3 + 71*1 + «l 

where x, is exogenous, to test whether >’ 2 and y 3 are endogenous, we 
estimate the equation 

7l = 02>2 + 03>' 3 + 71*1 + 2 + « 3 V 3 + W 

(f 2 and v 3 are the estimated residuals from the reduced form equations 
for y 2 and y,). 

(l) In the model in part (k), if x, is specified as exogenous and y 2 as 
endogenous, to test whether y 3 is endogenous, we estimate the equa¬ 
tion 

>1 = 0272 + 0373 + 7 1*1 + ol 3 v 3 + w 
and test a 3 = 0. 

3. Comment on the following: In a study on the demand for money, an econ¬ 
omist regressed real cash balances on permanent income. To test whether 
the interest rate has been omitted, she regressed the residual from this 
regression on the interest rate and found that the coefficient was not signif¬ 
icantly different from zero. She therefore concluded that the interest rate 
does not belong in the demand for money function. 

4. Explain how each of the following tests can be considered as a test for the 
coefficients of some extra variables added to the original equation. 

(a) Test for serial correlation. 

(b) Test for omitted variables. 

(c) Test for errors in variables. 

(d) Test for exogeneity. 

(e) ./-test. 

(f) PSW differencing test. 
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In each case explain how you would interpret the results of the test and the 
actions you would take. 


Appendix to Chapter 12 


(I) Least Squares Residuals 

We have y = X(X'X)-'X'y = Hy. H = (h tJ ) is known as the hat matrix. The 
least square residuals are u = (I - H)y = (I — H)u. This gives equation (12.1). 
Var(u) = (I - H)cr 2 since I — H is idempotent. Hence var(w,) = (1 — /z„)cr 2 , 
which is equation (12.3) and cov(w„ u) = - h,p 2 . Thus the least squares resid¬ 
uals are heteroskedastic and correlated. 


(2) The jR 2 Criterion 

The R 2 criterion or minimum <t 2 criterion for the choice of models is based on 
the following result: Suppose that 

y = xp + u is the true model. X is n x k. 
y = Z8 + v is the misspecified model. Z is n x r. 

We estimate the misspecified model. Consider the estimate of the residual vari¬ 
ance from this model. It is 


a 2 = ——y'Ny where N = I - Z(Z'Z)”'Z' 

n — r 


Since y = Xp + u and £(uu') = Icr 2 , we have y'Ny = (Xp + u)'N(Xp + u) = 
P'X'NXp + u'Nu + 2P'X'Nu. Since £(u) = 0 the last term has expectation 
zero. Also, £(u'Nu) = (n - r)a 2 since N is idempotent of rank (n - r). Hence 

we get E(a 2 ) = a 2 -I--— P'X'NXp. Since the second term in the expression 

n — r 

is > 0, we have 


£(<t 2 ) > (T 2 


Thus the estimate of the error variance from the misspecified equation is up¬ 
ward-biased. This is the basis of what is known as the “minimum s 2 ” or the 
“maximum R 2 ” rule. The rule says that if we are considering some alternative 
regression models, we should choose the one with the minimum estimated error 
variance. The idea behind it is that “on the average, the misspecified model 
has a larger estimated error variance than the “true” model. Of course, the 
suggested rule is based on the assumption that one of the models being consid¬ 
ered is the “true” model. It should be noted, however, that E( cr 2 ) = cr 2 even 
for a misspecified model if X'N = 0. This will be the case if Z consists of 
the variables X and any number of irrelevant variables. This does not make 
these models any better than models with a few omitted variables for which 
£(<r) > a 2 . 
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The preceding discussion shows that there are some drawbacks in choosing 
between models on the basis of estimated error variance. Hence some alter¬ 
native criteria in terms of estimated mean-squared error of prediction have 
been suggested. These criteria, called C p , PC, S p , have been discussed in Sec¬ 
tion 12.7. 
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13.1 Introduction 

A time series is a sequence of numerical data in which each item is associated 
with a particular instant in time. One can quote numerous examples: monthly 
unemployment, weekly measures of money supply, M, and M 2 , daily closing 
prices of stock indices, and so on. In fact with the current progress in computer 
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technology we have daily series on interest rates, the hourly “telerate” interest 
rate index, and stock prices by the minute (or even second). A cartoon in the 
New Yorker magazine shows a digital display at a bank: 

Time: 11.23 

Temp: 83 

Interest rate: 8.62 

An analysis of a single sequence of data is called univariate time-series anal¬ 
ysis. An analysis of several sets of data for the same sequence of time periods 
is called multivariate time-series analysis or, more simply, multiple time-series 
analysis (e.g., an analysis on the basis of monthly data, the relationships among 
unemployment, price level, money supply, etc., falls under multiple time-series 
analysis). The purpose of time-series analysis is to study the dynamics or tem¬ 
poral structure of the data. 

For a long time there has been very little communication between econo¬ 
metricians and time-series analysts. Econometricians have emphasized eco¬ 
nomic theory and a study of contemporaneous relationships. Lagged variables 
were introduced but not in any systematic way, and no serious attempts were 
made to study the temporal structure of the data. Theories were imposed on 
the data even when the temporal structure of the data was not in conformity 
with the theories. The time-series analysts, on the other hand, did not believe 
in economic theories and thought that they were better off allowing the data to 
determine the model. Since the mid-1970s these two approaches—the time- 
series approach and the econometric approach—have been converging. Econ¬ 
ometricians now use some of the basic elements of time-series analysis in 
checking the specification of their econometric models, and some economic 
theories have influenced the direction of time series work. 

13.2 Two Methods of Time-Series Analysis: 
Frequency Domain and Time Domain 


Time-series analysis can be roughly divided into two types of methods: fre¬ 
quency-domain methods and time-domain methods. In this chapter we discuss 
time-domain methods only. We shall, however, explain what these two methods 
are. 

In models underlying the frequency-domain analysis, the time series X, is 
expressed as the sum of independently varying cosine and sine curves with 
random amplitudes. We thus write X, as 

X, = m- + 2 lYj cos(2tt//) + Zj sin(2ir/jr)] 

J 

where F’s and Z’s are uncorrelated random variables with zero expectations 
and variances cr 2 {f), and the summation is over all frequencies. The frequencies 
/,, f 2 , f, ... , are equally spaced and separated by a small interval A/. The 
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purpose of the analysis is to see how the variance of X, is distributed among 
oscillations of various frequencies. The technique of analysis is called spectral 
analysis. 

Time-domain methods are based on direct modeling of the lagged relation¬ 
ships between a series and its past. The methods are similar to the type of 
analysis discussed in earlier chapters and involve fitting of linear autoregres¬ 
sions and cross-regressions. In any case it would be more difficult for beginning 
students to understand the spectral methods. Hence this introductory chapter 
on time-series deals with time-domain methods only. 

13.3 Stationary and Nonstationary 
Time Series 


From a theoretical point of view a time series is a collection of random vari¬ 
ables {X,}. Such a collection of random variables ordered in time is called a 
stochastic process. The word stochastic has a Greek origin and means “per¬ 
taining to chance.” If it is a continuous variable, it is customary to denote the 
random variables by X(t), and if t is a discrete variable, it is customary to 
denote them by X,. An example of the continuous variable X(t ) is the recording 
of an electrocardiogram. Examples of discrete random variables X, are the data 
on unemployment, money supply, closing stock prices, and so on, that we men¬ 
tioned earlier. We will not go into the theory of stochastic processes here in 
great detail. We will just outline those elements that are necessary for time- 
series analysis. Furthermore, we will be considering discrete processes only, 
and so we shall use the notation X, or X(t) interchangeably. 

The random variables {X,} are, in general, not independent. Furthermore, we 
have just a sample of size 1 on each of the random variables (e.g., if we say 
that the unemployment rate at the end of this week is a random variable, we 
have just one observation on this particular random variable). There is no way 
of getting another observation, so we have what is called a single realization. 
These two features—dependence and lack of replication—compel us to specify 
some highly restrictive models for the statistical structure of the stochastic pro¬ 
cess. 

Strict Stationarity 

One way of describing a stochastic process is to specify the joint distribution 
of the variables X,. This is quite complicated and not usually attempted in prac¬ 
tice. Instead, what is usually done is that we define the first and second mo¬ 
ments of the variables X,. These are: 

1. The mean: |x(r) = E(X,). 

2. The variance <r 2 (/) = var(A,). 

3. The autocovariances y(t u t 2 ) = cov(A„, X t2 ). 
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When /, = t 2 — t, the autocovariance is just <r 2 0). One important class of 
stochastic processes is that of stationary stochastic processes. Corresponding 
to these we have the concept of stationary time series. A time series is said to 
be strictly stationary if the joint distribution of any set of n observations 
X(t x ), X(t 2 ), .... X(t„) is the same as the joint distribution of X(t x + k), 
X(t 2 + k), ... , X(t„ + k) for all n and k. 

The definition above of strict stationarity holds for all values of n. Substitut¬ 
ing n = 1, we get jx(/) = p, a constant and o 2 (t) = cr 2 a constant for all t. 
Furthermore, if we substitute n = 2 , we get the result that the joint distribution 
of X(t x ) and X(t 2 ) is the same as that of X(t x + k) and X(t 2 + k). Writing r, + 
k = t 2 , we see that this is the same as the distribution of X{t 2 ) and X(t 2 + k). 
Thus, it just depends on only the difference ( t 2 — t x ), which is called the lag. 
Hence, we can write the autocovariance function 7 (/,, t 2 ) as 7 (k) where k = 
t 2 — /, the lag. Thus y(/c) = cov[A(t), X(t + k)\ is the autocovariance coefficient 
at lag k. 7 (k) is called the autocovariance function and will be abbreviated as 
acvf. 7 ( 0 ) is, of course, the variance of cr 2 . 

Since the autocovariance coefficients depend on the units of measurement 
of X(t), if is convenient to consider the autocorrelations that are free of the 
units of measurement. Since var X(t) = var X(t + k) - cr 2 = 7 ( 0 ), we have the 
autocorrelation coefficient p(A) at lag k as 


p(A) = 


7 ih) 

7(0) 


p (k) is called the autocorrelation function and will be abbreviated as acf A plot 
of p(/c) against k is called a correlogram. 


Weak Stationarity 

For a strictly stationary time-series the distribution of X(t) is independent of t. 
Thus it is not just the mean and variance that is constant. All higher order 
moments are independent of t. So are all higher order moments of the joint 
distribution of any combinations of the variables A(t,), X(t 2 ), X(t 3 ), .... In 
practice this is a very strong assumption, and it is useful to define stationarity 
in a less restrictive way. This definition is in terms of first and second moments 
only. 

A time series is said to be weakly stationary if its mean is constant and its 
acvf depends only on the lag, that is, 

E[X(t)] = p, and cov[A(/), X(t + /c)| = y(k) 

No assumptions are made about higher-order moments. Alternative terms 
used for this weakly stationary time series are wide-sense stationary, covari¬ 
ance stationary, or second-order stationary. 

If A(t,), X(t 2 ). . . . , X(t n ) follow a multivariate normal distribution, since 
the multivariate normal distribution is completely characterized by the first 
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and second moments, the two concepts of strict stationarity and weak station- 
arity are equivalent. For other distributions this is not so. In Figure 13.1 we 
show the graph of a stationary time series. It is the graph of X, where X, = 
0.7X,_ t + e, and e f ~ IN(5, 1). Because of the assumption of normality, this 
time series is strongly stationary. 


Properties of Autocorrelation Function 

There are two important points to note about the autocorrelation function acf. 
These are the following: 

1. The acf is an even function of the lag k [i.e., p(k) = p( — &)]. This follows 
from the result 

7 (k) = cov(X„ X l+k ) = covlX, k , X,) because of stationarity = y(-k) 

2. For a given acf, there will be only one normal process. But it is possible 
to find several nonnormal processes that have the same acf Jenkins and 
Watts 1 give an example of two different stochastic processes that have 
the same acf 

'G. M. Jenkins and D. G. Watts, Spectral Analysis and Its Applications (San Francisco: Hol- 
den-Day, 1968), p. 170. 
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Figure 13.2. Weekly average total transactions deposits component of M, for May 15, 
1974-January 6, 1982. Source: D. A. Pierce, M. R. Grupe, and W. P. Cleveland, “Sea¬ 
sonal Adjustment of the Weekly Monetary Aggregates,” Journal of Business and Eco¬ 
nomics Statistics, Vol. 2 (1984), p. 264. 


Nonstationarity 

In time-series analysis we do not confine ourselves to the analysis of stationary 
time series. In fact, most of the time series we encounter are nonstationary. A 
simple nonstationary time-series model is X, = p, + e„ where the mean p, is 
a function of time and e, is a weakly stationary series. p„ for example, could 
be a linear function of t (a linear trend) or a quadratic function t (a parabolic 
trend). In Figure 13.2 we show the graph of a nonstationary time series. It is 
clear that apart from a linear trend, holiday season peaks dominate the move¬ 
ments in the series. 


13.4 Some Useful Models for Time Series 


In this section we discuss several different types of stochastic processes that 
are useful in modeling time series: (1) a purely random process, (2) a random 
walk, (3) a moving-average (MA) process, (4) an autoregressive (AR) process, 
(5) an autoregressive moving-average (ARMA) process, and (6) an autoregres¬ 
sive integrated moving-average (ARIMA) process. 
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Purely Random Process 


This is a discrete process {A",} consisting of a sequence of mutually independent 
identically distributed random variables. It has a constant mean and a constant 
variance and the acvf is 


y(k) = cov(X„ X t+k ) = 0 for k¥= 0 


The acf is given by 



for k = 0 
for k ^ 0 


( 


A purely random process is also called white noise. 


Random Walk 

This is a process often used to describe the behavior of stock prices (although 
there are some dissidents who disagree with this random walk theory). Suppose 
that {e,} is a purely random series with mean p and variance a 2 . Then a process 
{A",} is said to be a random walk if 

X, = X,_, + e, 

Let us assume that X 0 is equal to zero. Then the process evolves as follows: 
X, = e, 

X 2 = X, + e 2 = e, + e 2 and so on 
We have by successive substitution 

X, = £ e, 

1=1 

Hence E(X,) = tp and varfA',) = ter. Since the mean and variance change with 
t, the process is nonstationary, but its first difference is stationary. Referring 
to share prices, this says that the changes in a share price will be a purely 
random process. 


Moving-Average Processes 

Suppose that {e,} is a purely random process with mean zero and variance a 2 . 
Then a process {X,} defined by 

Xf — P 0 e, T Pi e, _] + ■•• + p m E,„ m 

is called a moving-average process of order m and is denoted by MA(m). Since 
the e’s are unobserved variables, we scale them so that p„ = 1. Since £'(e,) = 

0 for all t, we have E(X,) = 0. Also, var(X,) = P?) 0-2 since the e, are 
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independent with a common variance a 2 . Further, writing out the expressions 
for X, and X, k in terms of the e’s and picking up the common terms (since the 
e’s are independent), we get 

y(k) = cov(X„ X,_, k ) 

m — k 

a 2 X PA+* for k = 0, 1, 2, . . . , m 

— ' i=0 

■ 0 for k > m 

Also considering cov(X„ X l+k ), we get the same expressions as for y(k). 

Hence y(-k) = y(k). The acf can be obtained by dividing y(k) by var(X,). 
For the MA process, p(k) = 0 for k > m, that is, they are zero for lags greater 
than the order of the process. Since y(k) is independent of t, the MA(m) process 
is weakly stationary. Note that no restrictions on the (5, are needed to prove the 
stationarity of the MA process. 

To facilitate our notation we shall use the lag operator L. It is defined by 
L J X, = X, , for all./. Thus LX, = X, „ L 2 X, = X, _ 2 , L'X, = X, + l , and so on. 
With this notation the MA(m) process can be written as (since fJ„ = 1) 

X, = (1 + p,L + (3 2 JL 2 + • • • + 0 m L m )e, = 0CL)e, (say) 

The polynomial in L has m roots and we can write 

X, = (1 - ir,L)(l - tt 2 L) •••(!— tt m L)e, 

where it,, tt 2 , . . . , iT m are the roots of the equation 

Y m + + • ■ • + = o 

After estimating the model we can calculate the residuals from e, = O(L)] ~ 'x, 
provided that ((3(L)J 1 converges. This condition is called the invertibility con¬ 
dition. The condition for invertibility is that |tt,| < 1 for all i. 2 This implies that 
an MA(m) process can be written as an AR(°°) process. 

For instance, for the MA(2) process 

X, = (1 + p,L + p 2 L 2 )e, 

it, and tt 2 are roots of the quadratic equation Y 2 + p,T + p 2 = 0. 

The condition |ir ( -| < 1 gives 

I-Pi ± VP? -4p 2 ^ 


This gives the result that p, and p 2 must satisfy: 

P. + P 2 > -1 

P2-P.>~1 (13.1) 

IPJ < 1 

J An alternative statement often found in books on time series is that the roots of the equation 
1 + P,Z + P 2 Z 2 + ■ • ■ + p,„Z” = 0 all lie outside the unit circle. 
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[The last condition is derived from the fact that p 2 = the product of the 
roots. The first two conditions are derived from the fact that if p, - 4p 2 > 0, 
then (p? - 4p 2 ) < (2 + p,) 2 or (p 2 - 4p 2 ) < (2 - p,) 2 .] 

Moving-average processes arise in econometrics mostly through trend elim¬ 
ination methods. One procedure often used for trend elimination is that of suc¬ 
cessive differencing of the time series X,. If we have 

X, = a 0 + a x t + a 2 t 2 + e, 

where e, is a purely random process, successive differencing of X, will eliminate 
the trend but the resulting series is a moving-average process that can show a 
cycle. Thus the trend-eliminated series can show a cycle even when there was 
none in the original series. This phenomenon of spurious cycles is known as 
the Slutsky effect . 3 


Autoregressive (AR) Process 

Suppose again that {e,} is a purely random process with mean zero and variance 
a 2 . Then the process {X,\ given by 

X, = a l A',„ l + a 2 X,_ 2 + • • • + a,X l _ l + e, (13.2) 

is called an autoregressive process of order r and is denoted by AR(r). Since 
the expression is like a multiple regression equation, it is called “regressive.” 
However, it is a regression of X, on its own past values. Hence it is autoregres¬ 
sive. 

In terms of the Lag operator L, the AR process (13.2) can be written as 
X, = (a t L + a 2 L 2 + • • • + a,L r )X, + e, 


or 


(1 - a x L - a 2 L 2 ■ ■ ■ — a r L r )X, = e, (13.3) 


or 


' 1 — a x L — a 2 L 2 ■ • • a r U) ' 

__1_ 

~ (1 - ir,L)(l - tt 2 L) • • • (1 - 7T ,L) E ' 

where tt,, tt 2 > • • • > "nv are the roots of the equation 

r - a,F“' - a 2 r~ 2 . . . cq = 0 

The condition that the expansion of (13.3) is valid and the variance of X, is 
finite is that |tt,| < 1 for all /. 

To find the acvf, we could expand (13.2), but the expressions are messy. An 
alternative procedure is to assume that the process is stationary and see what 


3 E. E. Slutsky, “The Summation of Random Causes as a Source of Cyclic Processes,” Econo- 
metrica, Vol. 5, 1937, pp. 105-146. 
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the p (k) are. To do this we multiply equation (13.2) throughout by X, k , take 
expectations of all the terms and divide throughout by var(Y), which is as¬ 
sumed finite. This gives us 

p(k) = a,p (k - 1) + • • • + a k p(k - r) 

(substituting k = 1, 2, . . . , r and noting p(k) = p(-k) we get equations to 
determine the r parameters a,, a 2 , . . . , a,). These equations are known as the 
Yule-Walker equations. 

To illustrate these procedures we will consider an AR(2) process 


X, = cl x X,^ x + a 2 X t ^ 2 + 6 / 


-ir, and tt 2 are the roots of the equation 

Y 2 - a,F - cx 2 = 0 


Thus j-iT,| < 1 implies that 


<Xi ± \/«r + 4a 2 
2 


< 1 


This gives 


a] + a 2 < 1 


a, — a 2 > 1 (13.4) 

|a 2 | < 1 

[The conditions are similar to the conditions (13.1) derived for the invertibility 
of the MA(2) process.] 

In the case of the AR(2) process we can also obtain the p(k) recursively using 
the Yule-Walker equations. We know that 

p(0) = 1 and p(l) = a,p(0) + a 2 p(-l) 


Thus 


= a,p(0) + a 2 p(l) or p(l) = 


1 - a 2 


p(2) = a,p(l) + a 2 p(0) = ■——— + a 2 

1 - a 2 

cq(a? + ot 2 ) 

p(3) = a,p(2) + a 2 p(l) = —-- + a,a 2 

1 — a 2 

and so on. 

As an example, consider the AR(2) process 

X, = 1.0Y ( _, - 0.5Y,_ 2 + 8| 

Here a, = I.0anda 2 = -0.5. Note that conditions (13.4) for weak stationarity 
are satisfied. However, since a, + 4a 2 < 0 the roots are complex and p (k) will 
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be a sinusoidal function. A convenient method to derive p(/c) is to use the re¬ 
currence relation (also known as the Yule-Walker relation) 

p(k) = p (k - 1) — 0.5p(& — 2) 


noting that p(0) = 1 and p(l) = a,/(l - 

a 2 ) = 0.6666. We then have 

p(2) = 0.1666 

p(3) = -0.1666 

p(4) = -0.25 

p(5) = -0.1666 

p(6) = -0.0416 

p(7) = 0.0416 

p(8) = 0.0624 

p(9) = 0.0416 

p(10) = 0.0104 

p(ll) = -0.0104 

p(12) = -0.0156 

p(12) = -0.0104 

This method can be used whether the roots are real or complex. 


A plot of this correlogram is left as an exercise. 

Autoregressive Moving-Average (ARMA) Processes 

We will now discuss models that are combinations of the AR and MA models. 
These are called ARMA models. An ARMA(p, q) model is defined as 

X, = a 1 A' ( _ l + • • • + CLpX^p + e, + P^,^ + • • • + P 9 e,_ 9 

where {e,} is a purely random process with mean zero and variance or 2 . The 
motivation for these methods is that they lead to parsimonious representations 
of higher order AR(p) or MA(ij) processes. 

Using the lag operator L, we can write this as 

<KL)Y, = 6(L)e, 

where <J>(L) and 0(L) are polynomials of orders p and q, respectively, defined as 

4>(L) = 1 — a,L — a 2 L 2 • • • a p L p 
6(L) = 1 + P ,L + p 2 L 2 + • • • + fiJJ 

For stationarity we require that the roots of c[>(L) = 0 lie outside the unit 
circle. For invertibility of the MA component, we require that the roots of 0(L) 
lie outside the unit circle. For instance, for the ARMA(2, 2) process these con¬ 
ditions are given by equations (13.1) and (13.4). The acvf and acf of an ARMA 
model are more complicated than for an AR or MA model. 

We will derive the acf for the simplest case: the ARMA(1, 1) process 

X, = + e, + P,e,_, 

In terms of the lag operator L this can be written as 

X, - = e, + Pie,., 


or 


or 


(1 - a { L)X, = (1 + P,L)e, 
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X, 


1 + 0L 

1 - aL E 


= (1 + 0L)(1 + aL + a 2 L 2 + • • -)e, 

= [1 + (a + 0)L + a(a + 0)L 2 + a 2 (a + 0)L 3 + • • 
Since e, is a pure random process with variance or 2 we get 

var(T ( ) = [1 + (a + 0) 2 + a 2 (a + 0) 2 + • • -]<t 2 

(a + 0) 2 \ , 1 + 0 2 + 2a0 , 

1 - a 2 / = 1 - a 2 ° 

Also 



COv(A„ X ,_._,) = [(a + 0) + a(a + P) 2 + a 3 (a + P) 2 + • • -]<t 2 


+ P + 


(a + P) 2 a\ , 
1 - a 2 


(a + P)(l + a0) 

1 - a 2 CT 


Hence 


= cov(A„ AT,,,) = (a + P)(i + «P) 
var(A,) 1 + P 2 + 2«p 

Successive values of p (k) can be obtained from the recurrence relation p(k) = 
ap(k - 1) for k > 2. For the AR(1) process of p(l) = a. It can be verified that 
p(l) for ARMA(1, 1) process is > or < a depending on whether p > 0 or < 0, 
respectively. 


Autoregressive Integrated Moving-Average 
(ARIMA) Processes 

In practice, most time series are nonstationary. One procedure that is often 
used to convert a nonstationary series to a stationary series is successive dif¬ 
ferencing. Let us define the operator A = 1 — L, so that AX, = X, — X, ,. 

A 2 X, = ( X, - X, ,) - (X, [ - X, 2 ) and so on 

Suppose that A d X, is a stationary series that can be represented by an ARMA 
( p, q) model. Then we say that X, can be represented by an ARIMA(p, d, q) 
model. The model is called an integrated model because the stationary ARMA 
model that is fitted to the differenced data has to be summed or “integrated” 
to provide a model for the nonstationary data. Actually, even if there is no need 
for a moving-average component in modeling X„ the procedure of differencing 
X, will produce a moving-average process (the Slutsky effect mentioned in our 
discussion of the MA process). 
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13.5 Estimation of AR, AAA, and 
ARAAA Models 


The estimation of AR models is straightforward. We estimate them by ordinary 
least squares by minimizing 2 e ?- The only problem is that of the choice of the 
degree of autoregression. This is discussed in Section 13.7. There is also a loss 
in the number of observations used as the lag length increases. This is not a 
problem if we have a long time series. 


Estimation of MA Models 

We shall illustrate the estimation of MA models with a simple second-order 
MA process. Consider the MA(2) model 

AT, = p, + e, + P,e,_, + p 2 e,_ 2 

In the case of MA models, we cannot write the error sum of squares as 
simply a function of the observed x's and the parameters as in the AR models. 
What we can do is to write down the covariance matrix of the moving-average 
error and assuming normality, use the maximum likelihood method of estima¬ 
tion. 4 An alternative procedure suggested by Box and Jenkins 5 is the grid- 
search procedure. In this procedure we compute e, by successive substitution 

for each value of (p |f p 2 ) given some initial values, say p = x and e 0 = e., = 

0. We then have, for the MA(2) model, 

e, = x, - p, 

e 2 = x 2 - p - P,e, 

e, = x, — p — p,e ; _, - P 2 e,- 2 for t > 3 

Thus successive values of e, can be generated and can computed for 
each set of values of (p,, p 2 ). This grid search is conducted over the admissible 
range of values for (P,, p 2 ) given by equations (13.1) and the set of values 
(P,, p 2 ) that minimizes is chosen. This grid-search procedure is, of course, 
not very practicable if we have many parameters in the MA process. However, 
in practice one usually uses a low-order MA process or a low-order MA com¬ 
ponent in an ARMA process. 


4 For this see D. R. Osborne, “Maximum Likelihood Estimation of Moving Average Processes,” 
Annals of Economic and Social Measurement, Vol. 5, 1976, pp. 75-87, and J. E. H. Davidson, 
“Problems with the Estimation of Moving Average Processes,” Journal of Econometrics. Vol. 
19, 1981. pp. 295-310. 

5 G. E. P. Box and G. M. Jenkins, Time Series Analysis, Forecasting and Control, rev. ed. fSan 
Francisco: Holden-Day, 1976), Chap. 7. 
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Estimation of ARMA Models 

We can now consider the estimation of ARMA models. Again, the problem is 
with the MA component. Either we have to write down the covariance matrix 
for the errors in the MA component and use ML methods or use the grid-search 
procedure for the MA component. We shall discuss the latter procedure. Con¬ 
sider an ARMA(2, 2) model 

X, = a,A,_, + a 2 X,_ 2 + e, + P,e,_] + P 2 e ( _ 2 

This can be written as 


(1 — o'] L — a 2 L 2 )X, — e, + P|E,_j + p 2 e,_ 2 

or 

X, = 7 - j~z — 75(e< + Pi e /-i + P2 e »- 2 ) (13.5) 

Let 


‘ 1 — a \L — <x 2 U^' 

Multiplying both sides by (1 — a t L — a 2 L 2 ) we get 

Z, - a,Z,_] - a 2 Z,_ 2 = e, (13.6) 

Also, from (13.5) we have 

X, = Z, + P|Z, | -l- p 2 z ,_ 2 
or 

z, = x, — p]Z, ] — p 2 z ,_ 2 

The grid-search procedure is as follows: Starting with Z 0 = Z„, =0, we gen¬ 
erate successive values of Z, for different sets of values for (p,, p 2 ) in the region 
given by equations (13.1) as follows: 

Z, = A, 


z 2 = x 2 — p,z. 


Z, - X, - P]Z,_] - p 2 Z,„ 2 for (> 3 

We then use the generated Z, to estimate the parameters (a,, a 2 ) in (13.6) by 
ordinary least squares. We choose those values of (p,, p 2 ) that minimize ^e 2 . 
The corresponding values a, and a 2 give the estimates of a, and a 2 . 

For ARIMA models the procedure described above is used after successively 
differencing the given series until it is stationary. We have discussed the grid- 
search procedure here. Given the current high-speed computers, it is possible 
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to compute the exact maximum likelihood estimates as well. An algorithm for 
this is described in Ansley. 6 

Residuals from the ARMA Models 

After obtaining estimates of the parameters (a,, a 2 , p,, p 2 ), we get the predicted 
residuals from equation (13.6). We have 

E, = Z, — tifZ, | — CH 2 Z t . 2 

Note that Z, are obtained from the X's and the final estimates (3, and p 2 of p, 
and p 2 , respectively. These residuals are useful for forecasting from ARMA 
models. This is discussed in the next section. 

An alternative way of obtaining the residuals is to solve the ARMA(2, 2) 
model by expanding the expressions in terms of the lag operator L. Note that 
the model (1 - a,L - a 2 l?)x, = (1 + p,L + p 2 L 2 )e, gives 

e, = (1 + P,L + P 2 L 2 )- '(1 — a,L — <x 2 L 2 )x, 

Since this is a power series in L, we should write it as: 

(1 + y,L + y 2 L 2 + y 3 Z, 3 + y 4 Z, 4 + ■ • •)x , 

We have to find the y r This can be done by noting that (1 + P,L + p 2 L 2 ) (1 + 
y,L + y 2 L 2 + y 3 V + •••) = (1 - a t L - a 2 L 2 ) and equating the coefficients 
of like powers of L. This gives 

coefficient of L = p, + -y, = -a, or -y, = -(a, + p,) 
coefficient of L 2 = p 2 + PiTi + y 2 = — a 2 or 
y 2 = ~(a 2 + p 2 ) + P)(a, + PJ 
coefficient of L 3 = p,y 2 + p 2 y, + y 3 = 0 or 
73 = “0)72 + 027l) 

The rest of the y’s can be obtained recursively from the relation 

coefficient of L 1 ‘ 1 = y J+] = -(Pi7 3 + P 2 7,i) 

Once we get the y r we can write e, = x, + y t x, , + y 2 x, 2 + • • • and get the 
estimated residual e, by substituting y for y. We shall not provide examples 
here; they are provided in the exercises at the end of the chapter. 

Testing Goodness of Fit 

When an AR, MA, or ARMA model has been fitted to a given time series, it is 
advisable to check that the model does really give an adequate description of 
the data. There are two criteria often used that reflect the closeness of fit and 

6 C. F. Ansley, “An Algorithm for the Exact Likelihood of Mixed Autoregressive Moving Av¬ 
erage Process,” Biometrika, Vol. 66, 1979, pp. 59-65. 
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the number of parameters estimated. One is the Akaike information criterion 
(AIC), and the other is Schwartz Bayesian criterion (SBC). The latter is also 
called the Bayesian information criterion (BIC). If p is the total number of 
parameters estimated, we have 

AIC(p) = n log tr 2 + 2 p 

BIC(p) = n log dp + p log n 

Here n is the sample size. If RSS is the residual sum of squares, ]>)§?, then 
dp = RSS/(« — p). If we are considering several ARM A models, we choose 
the one with the lowest AIC or BIC. (The two criteria can lead to different 
conclusions.) These goodness-of-fit criteria are more like the R 1 or minimum 
(r 2 -type criterion. In addition, we have to check the serial correlation pattern 
of the residuals—that is, we need to be sure that there is no serial correlation. 
One can look at the first-order autocorrelation among the residuals. However, 
as discussed in Chapter 6, one cannot use the Durbin-Watson statistic. With 
autoregressive models, we have to use Durbin’s fi-test, or the LM test dis¬ 
cussed in Section 6.8. 

Box and Pierce 7 suggest looking at not just the first-order autocorrelation but 
autocorrelations of all orders of the residuals. They suggest calculating Q = 
N | rj, where r k is the autocorrelation of lag k and N is the number of 
observations in the series. If the model fitted is appropriate, they argue that g 
has an asymptotic \ 2 distribution with m — p— q degrees of freedom, where p 
and q are, respectively, the orders of the AR and MA components. 

Actually, though the g-stati sties are quite widely used by those using time- 
series programs (there is no need to list here the hundreds of papers, books and 
programs that still use them); they are not appropriate in autoregressive models 
(or models' 1 with lagged dependent variables). The arguments against their use 
are exactly the same as those against the use of the DW statistic, as discussed 
in Chapter 6. One can use Durbin’s h -test but that tests for only first order 
autocorrelation. The g-statistics are designed to test correlations of higher or¬ 
ders as well. For this purpose it is appropriate to use the LM test as suggested 
in Secton 6.8 of Chapter 6. 

The discussion in the time-series literature does not pay any attention to this 
aspect of the inappropriateness of the g-statistics. The Box-Pierce paper ap¬ 
peared in 1970 and Durbin’s paper, which showed the inappropriateness of us¬ 
ing the DW test with lagged dependent variables (autoregressive models) and 
suggested an alternative, also appeared in the same year. In spite of the fact 
that the discussion of the g-statistics in the time-series literature was all in the 
1970s after Durbin’s paper appeared, it all concentrated on the “low power” 
of the g-statistics. For instance, Chatfield and Prothero 8 fitted four different 


’G. E. P. Box and D. A. Pierce, “Distribution of Residual Autocorrelations in Autoregressive 
Integrated Moving Average Time Series Models,” Journal of the American Statistical Associ¬ 
ation, Vol. 65. 1970, pp. 1509-1526. 

8 C. Chatfield and D. L. Prothero, “Box-Jenkins Seasonal Forecasting: Problems in a Case 
Study,” Journal of the Royal Statistical Society, Series A, Vol. 136, 1973, pp. 295-336. 
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ARIMA models to the same set of data, but all four models gave a nonsignifi¬ 
cant value of g. Ljung and Box 9 suggest a modification of the g-statistic for 

m 

moderate sample sizes. They suggest Q* = TV(TV + 2) ^ (TV — k) 'rj instead 

1 

of the g-statistic. However, even this statistic was found to have low power by 
Davies and Newbold. 10 This too is inappropriate, just the way g is, in autore¬ 
gressive models. 

Thus one has to replace the g-test with some other tests. One alternative 
that has been suggested is to use Lagrange-Multiplier (LM) tests." We dis¬ 
cussed this earlier in Section 6.8 of Chapter 6. They are, as shown there, very 
simple to implement. Godfrey 12 discusses a different type of LM test. This is 
to check the adequacy of the degree of auto regression in the ARMA model. 

Suppose that we estimate an ARMA(p, q) model and ask whether we should 
be estimating the extended ARMA (p + m, q) model. As before, a, represent 
the parameters in the AR part and ft represent the parameters in the MA part. 

Denote by e, the computed residuals from the ARMA(p, q) model. Now con¬ 
sider the extended ARMA(p, + m, q) model and compute 


de, 


and 


dE, 

5ft 




i = 1, 2, . . . , p + m 
j = 1,2, ... ,q 


Let v„ and w ]t be the values of v„ and w lt , respectively, evaluated at the ML 
estimates of the restricted ARMA(p, q) model (i.e., setting a, = 0 for i = p + 
1, . . . , p + m). Now estimate a regression of e, on the (p + in + q) variables 
v„ and w Jt . Let R 2 be the coefficient of determination and TV the number of 
observations. The LM statistic is given by 


LM = NR 2 


which has (asymptotically) a Xm-distribution. 

Godfrey studies the finite sample properties of the LM test in the context of 
overfitting an AR(1) process by higher-order autoregressive models. He finds 
that the power of the LM tests is higher than that of the g and g* tests unless 
the number of overfitted parameters is large. In the special case where m is 
large, the LM test and the tests like g and g* coincide. This explains why the 
g-statistic or g*-statistic is not very useful, because when we are testing the 
adequacy of the ARMA(p, q) model, the alternative we have in mind is an 
ARMA(p + m, q) model with large m. 

Godfrey’s procedure of using auxiliary regression of e, on v„ and u; ( , is also 


9 G. M. Ljung and G. E. P. Box, “On a Measure of Lack of Fit in Time Series Models,” Bio- 
metrika, Vol. 65, 1978, pp. 297-303. 

I0 N. Davies and P. Newbold, “Some Power Studies of a Portmanteau Test of Time Series Model 
Specification.” Biometrika, Vol. 66. 1979. pp. 153-155. 

"See J. R. M. Hosking, “Lagrange Multiplier Tests of Time Series Models,” Journal of the 
Royal Statistical Society, Series B, Vol. 42, 1980, pp. 170-181, for a general discussion. 

I2 L. G. Godfrey, “Testing the Adequacy of a Time Series Model,” Biometrika, Vol. 66, 1979, 
pp. 67-72. 
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applicable when the alternative is ARMA(p + k v q + k 2 ) with m = max 
(k x , k 2 ), as shown in Poskitt and Tremayne. 13 Despite its limitations discussed 
here, there are numerous studies that quote the Box-Pierce (7-statistic and ar¬ 
gue that if it is not significant (at the 5 percent level), this is confirmatory evi¬ 
dence that the fitted model is adequate. The available evidence shows that the 
LM tests are more appropriate, and they can be implemented by some regres¬ 
sion routines. However, until they are built into computer programs, they are 
not likely to be used. Currently, most time-series programs have the (7-statistic 
built into them despite all the limitations of the (7-statistics discussed here. 

13.6 The Box-Jenkins Approach 


The Box-Jenkins approach 14 is one of theThost widely used methodologies for 
the analysis of time-series data. It is popular because of its generality; it can 
handle any series, stationary or not, with or without seasonal elements, and it 
has well-documented computer programs. It is perhaps the last factor that con¬ 
tributes most to its popularity. Although Box and Jenkins have been neither 
the originators nor the most important contributors in the field of ARMA 
models, 15 they have popularized these models and made them readily accessi¬ 
ble to everyone, so much so that ARMA models are sometimes referred to as 
Box-Jenkins models. 

The basic steps in the Box-Jenkins methodology are (1) differencing the se¬ 
ries so as to achieve stationarity, (2) identification of a tentative model, (3) 
estimation of the model, (4) diagnostic checking (if the model is found inade¬ 
quate, we go back to step 2), and (5) using the model for forecasting and con¬ 
trol. Schematically, we can describe the steps as in Figure 13.3. We will now 
discuss these steps in turn. 

1. Differencing to achieve stationarity: How do we conclude whether a time 
series is stationary or not? We can do this by studying the graph of the 
correlogram of the series. The correlogram of a stationary series drops 
off as k, the number of lags, becomes large, but this is not usually the 
case for a nonstationary series. Thus the common procedure is to plot 
the correlogram of the given series y, and successive differences Ay, A 2 y, 
and so on, and look at the correlograms at each stage. We keep differ¬ 
encing until the correlogram dampens. 

2. Once we have used the differencing procedure to get a stationary time 
series, we examine the correlogram to decide on the appropriate orders 
of the AR and MA components. The cdfrelogram of a MA process is zero 
after a point. That of an AR process declines geometrically. The corre- 


n D. S. Poskitt and A. R. Tremayne, “Testing the Specification of a Fitted ARMA Model,” 
Biometrika, Vol. 67, 1980, pp. 359-363. 

l4 See Box and Jenkins, Time Series Analysis. The first edition of the book appeared in 1970. 
''These models were discussed earlier in M. H. Quenouille, The Analysis of Multiple Time 
Series (London: Charles Griffin, 1957). 
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Figure 13.3. The Box-Jenkins methodology for ARIMA models. 

lograms of ARMA processes show different patterns (but all dampen af¬ 
ter a while). Based on these, one arrives at a tentative ARMA model. 
This step involves more of a judgmental procedure than the use of any 
clear-cut rules. 

3. The next step is the estimation of the tentative ARMA model identified 
in step 2. We have discussed in the preceding section the estimation of 
ARMA models. 

4. The next step is diagnostic checking to check the adequacy of the tenta¬ 
tive model. We discussed in the preceding section the Q and Q* statistics 
commonly used in diagnostic checking. As argued there, the Q statistics 
are inappropriate in autoregressive models and thus we need to replace 
them with some LM test statistics. 

5. The final step is forecasting. We shall now discuss this problem. 

Forecasting from Box—Jenkins Models 

To fix ideas we shall illustrate forecasting from the ARMA(2, 2) model we have 

been considering. Suppose that we have estimated the model with n observa¬ 
tions. We want to forecast x n+k . This is called a /c-period ahead forecast. It is 
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denoted by £ n k . The first subscript gives the time period when the forecast is 
made, and the second subscript denotes the time periods ahead for which the 
forecast is made. Let us start with k = 1 so that we need a forecast of x n + , at 
time period n. We have 

*„ + i = a t x„ + oi 2 *„-i + + Pi e « + P 2 E«-i 

We observe x n and x„ ,. We can replace e„ and e„_, by the predicted residuals 
(obtaining the residuals was described in the preceding section). The only un¬ 
known is e„ +1 . This we replace by its expected value zero. Hence 

*„.i = + <**„ , + p,e„ + p 2 §„_i 

Now let us go to k = 2. We have 

x„ +1 = a,x„ + l + ot 2 x„ + e„ + 2 + 0,e„ + 1 + P 2 e„ 

We replace e „ +2 and e„ +) by zero, their expected value. x„,, is not known, but 
we have the forecast x n Thus we get 

4.2 = «l*„.l + <*2*„ + P iK 

We continue like this. The procedure is: 

1. Write out the expression for x„ +k . 

2. Replace all future values x, ltJ (j > 0,./' < k ) by their forecasts. 

3. Replace all e„ +/ (J > 0) by zero. 

4. Replace all e „(J < 0) by the predicted residuals. 

An alternative procedure is to write x, in terms of all the lagged x’s. For this 
we use the procedure outlined in the preceding section. We have 

(1 + + y 2 L 2 + y 3 L 3 + • • -)x, = e, 

The y’s are obtained as discussed in the preceding section. Now we have 

*r+i = — ["Yi-x, + 72 *,-1 + 7 3 *< 2 +•••] + e,+ i 

To get Je fJ , we replace e, +) by zero, its expected value, and 7 ’s by their esti¬ 
mates. The procedure is the same as earlier except that we do not have to deal 
with the predicted residuals. 

Illustrative Example 

As an illustrative example we consider the problem of forecasting hog market¬ 
ings considered by Leuthold et al . 16 It is an old study, but the example illus¬ 
trates how the correlogram can be used to arrive at a model that uses higher 
than first order differences. The data consist of 275 daily observations. The 
correlograms for the original data are presented in Figure 13.4. The correl- 


,6 R. M. Leuthold, A. M. A. MacCormick, A. Schmitz, and D. G. Watts, “Forecasting Daily 
Hog Prices and Quantities: A Study of Alternative Forecasting Techniques,” Journal of the 
American Statistical Association, March 1970, pp. 90-107. 
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ogram does not damp, thus indicating nonstationarity. The peaks at 5, 10, 
15, . . . indicate a strong 5-day weekly effect. Figure 13.5 shows the correlo- 
gram of first differences. It still shows peaks at 5, 10, 15, and so on, and it does 
not show any sign of damping. Since the peaks do not damp, it suggests a fifth- 
order MA component as well. We next try fifth differences, that is X, — X, 5 . 
The correlogram, shown in Figure 13.6, damps. But the initial decline and os¬ 
cillation suggests the use of an ARMA model rather than a pure AR or MA 
model. Leuthold et al. finally arrive at the model 


(1 - <J>T 5 )(1 - ct,L - a 2 L 2 )x, = (1 - ei. 5 )(l + (3,L + p 2 L 2 )e, 
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Figure 13.6. Autocorrelation function of fifth differences. 


There are three parameters in the MA part. Hence they use the grid-search 
procedure on the three parameters 6, (3,, and 0 2 . The estimated parameters 
were 


<J> = 0.90 cti = 1.44 a 2 = —0.47 

6 = 0.70 (3, = — 1.52 0 2 = 0.66 

The model therefore is 

(1 - 0.90T 5 )(1 - 1.44/. + 0.47L 2 )x, = (1 - 0.70L 5 )(J - 1.52T + 0.66L 2 )e, 
This gives 

x, = 1.44x,_, - 0.47 x ,_ 2 + 0.90x,_ 5 —(0.90 x 1.44)x ,_ 6 

+ (0.90 x 0.47 )x ,„ 7 + e, — 1.52e,„, + 0.66e ,_ 2 — 0.70<?,_ 5 

+ (0.70 x 1.52)e,_ 6 - (0.70 x 0.66)e ,_ 7 

This is the equation we use for forecasting purposes, using the methods out¬ 
lined earlier. Note that 0, and 0 2 satisfy the conditions in (13.1) and a, and a 2 
satisfy the conditions in (13.4). There were no diagnostic tests and comparison 
with alternative models. But the example is presented here as an illustration of 
how first differences were not appropriate but fifth differences were. 

Trend Elimination: The Traditional Method 

The Box-Jenkins method eliminates trend by differencing the series. A more 
traditional method adopts the ratio to moving-average method. An example of 
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Figure 13.7. Correlogram for the Beveridge trend-free price index. 

this is the Beveridge price index. 17 This is a historically interesting time series 
of wheat prices in western and central Europe from 1500 to 1869 (370 years). 
The index is made up from prices in nearly 50 places in various countries (with 
the mean for 1700-1745 = 100). Beveridge gets a trend-free index by expressing 
the index for a given year as a percentage of the mean of the 31 years for which 
it is the center. For instance, for the year 1600, we divide the index for 1600 by 
the average of the index over the years 1585-1615 and convert this to a per¬ 
centage. This is the ratio to moving-average method. Note that you lose 15 
observations at the beginning and 15 observations at the end. The last ones are 
often critical. 

Table 13.2 at the end of the chapter gives the Beveridge index and Figure 
13.7 gives the correlogram of the trend-free index, which shows that the series 
is stationary. Calculation of the correlogram of the raw index and the first dif¬ 
ference of the series is left as an exercise. 

A Summary Assessment 

The Box-Jenkins models were very popular in the early 1970s because of their 
better forecasting performance compared to econometric models. The main 
drawback of econometric models at that time was that inadequate attention 

l7 W. H. Beveridge, “Weather and Harvest Cycles,” Economic Journal, Vol. 31, 1921, pp. 429- 
452. 
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was paid to the time-series structure of the underlying data and the specifica¬ 
tion of the dynamic structure. Once this is done, the forecasting performance 
of econometric models improves considerably, and the econometric models 
started to do better than univariate time-series models. 18 Later it was argued 
that the appropriate comparison should be between econometric models and 
multivariate time-series models (not univariate time-series models). However, 
in the early 1970s what was being argued is that even univariate ARIMA 
models did better in forecasting than econometric models. At least this claim 
can be put to rest by taking care of the dynamics in the specification of econ¬ 
ometric models. As for multivariate time-series models versus econometric 
models, there is still some controversy. 

Quite apart from the forecasting aspect, the Box-Jenkins methodology in¬ 
volves a lot of judgmental decisions and a considerable amount of “data min¬ 
ing. ” This data mining can be avoided if we can confine our attention to auto¬ 
regressive processes only. Such processes are easier to estimate and easier to 
undertake specification testing. We start with a fairly high-order AR process 
and then simplify it to a parsimonious model by successive “testing down.” 
This procedure has been suggested by Anderson. 19 For ARMA models, on the 
other hand, moving from the specific to the general is difficult. This is why Box 
and Jenkins suggest starting with a parsimonious model and building up pro¬ 
gressively if the estimated residuals show any systematic pattern (see the ear¬ 
lier discussion on diagnostic checking). 

The reason why one may not be able to avoid the MA part in the Box- 
Jenkins methodology is that the MA part can be produced by the differencing 
operation undertaken to remove the trend and produce a stationary series. For 
instance, if the true model is 

y, = a + (Bt + e, 

taking second differences we eliminate the trend but the residual series e, — 
2e,_, + e ( „ 2 is a MA(2) process and it is not even invertible. 

Seasonality in the Box—Jenkins Modeling 

Suppose that we have monthly data. Then observations in any month are often 
affected by some seasonal tendencies peculiar to that month. A classic example 
is the upward jump in sales in December because of Christmas shopping. The 
observation in December 1991 is more likely to be highly correlated with ob¬ 
servations in December of past years than with observations in other months 
closer to December 1991. 

Just as the Box-Jenkins’ method accounts for trends by using first differ¬ 
ences, it accounts for seasonality by using the twelfth difference, that is, (Jan- 

l8 An illustration of this comparison is D. L. Prothero and K. F. Wallis, “Modelling Macroeco¬ 
nomic Time-Series” (with discussion). Journal of the Royal Statistical Society, Series A, Vol. 
139, 1976, pp. 468-500. 

l9 See the discussion of Anderson’s test in Section 10.8. 
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uary 1991 - January 1990), (February 1991 - February 1990), (March 1991 — 
March 1990), and so on. If we denote the first difference operator by A = 1 - 
L, the seasonal difference will be denoted by A l2 = 1 — L n . With quarterly 
data we use the seasonal difference operator A 4 = 1 — L 4 . Note that A, 2 ¥= A 12 
because 1 - V 1 ^ (1 - L) n . If the original series is y„ we get Ay, = y, - y,_„ 
A 4 >’, - y, — y, 4 , A l2 y, = y, - y r _, 2 . If we have monthly data and we eliminate 
both trend and the monthly seasonal, we use both the operators A and A 12 — 
that is, we use both first differences and twelfth differences. Instead of using 
first difference, that is, (1 — L), we can sometimes consider a quasi-first dif¬ 
ference, that is, (1 - aL) [e.g., (1 - 0.8 L) or (1 - 0.75L)]. Similarly, for the 
seasonal, instead of using (1 - jL 12 ), we might use (1 — 0.8L 12 ), and so on. 

Chatfield 20 gives an example of some telephone data analyzed by Tomasek 21 
using the Box-Jenkins method. Tomasek developed the model 

(1 - 0.84T)(1 - L n )(y, - 132) = (1 - 0.60T)(1 + 0.37L ,2 )e, 

which when fitted to all the data, explained 99.4% of the total variation about 
the mean [i.e., ^ (>’, — y) 2 l- On the basis of this good fit, Tomasek recom¬ 
mended the use of the Box-Jenkins method for forecasting. 

Chatfield argues that if one looks at the data, one finds an unusually regular 
seasonal pattern which itself explains 97% of the total variation. For this model 
once the seasonal element is accounted for, nearly any forecasting method 
gives good results. For instance, an adaptive forecasting equation would be 

Su = Xy, + (1 - X)y, u 0<k<l 

(i.e., the forecasts are revised on the basis of the most recent forecast errors). 

The differencing operation in the Box-Jenkins procedure is also considered 
one of its main limitations in the treatment of series that exhibit moving sea¬ 
sonal and moving trend. The Box-Jenkins procedure accounts for the trend 
and seasonal elements through the use of differencing operators and then pays 
elaborate attention to the ARM A modeling of what is left. This inordinate at¬ 
tention to ARMA modeling obscures some other aspects of the time series that 
need more attention. It has been found that other procedures that permit adap¬ 
tive revision of the seasonal elements, trend term and the residual, may give 
better forecasts than the Box-Jenkins method. 

13.7 R 2 Measures in Time-Series Models 


First let’s consider pure autoregressive models: AR(p). How do we choose the 
order p? One might think of using R 2 and other criteria discussed in Section 
12.7, but most of these criteria were discussed under the assumption of non- 


20 C. Chatfield, The Analysis of Time Series: Theory and Practice (London: Chapman & Hall, 
1975), p. 103. 

21 The reference to Tomasek and the data used can be found in Chatfield’s book, p. 103. 
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stochastic regressors. One procedure that is popular for choosing the order p 
is Akaike’s FPE criterion . 22 This says: Choose the order p by minimizing 


FPE = a 2 


1 + 




where cr 2 is the estimate of a 2 = var(e,) when the order of the autoregressive 
process is p, that is, cr£ = RSS/(n — p — 1). 

Time-series observations normally show a strong trend (upward or down¬ 
ward) and strong seasonal effects. Any model that is able to pick up these 
effects will have a high R 2 . The question is: How good and reliable is this? For 
instance, Box and Jenkins (1976, Chap. 9) have a data set consisting of 144 
monthly observations on the variable y„ defined as the logarithm of the number 
of airline passengers carried per month. Regressing y, on time trend and sea¬ 
sonal dummies gives R 2 = 0.983. Is this a good model? 

Harvey 23 suggests some measures of relative R 2 's to judge the usefulness of 
a model. Note that the criterion on which the usual R 2 is based is the residual 
sum of squares from the model relative to the residual sum of squares from a 
naive alternative (that consists of the estimation of the mean only). For instance 
if S yy = 2 O'/ — y ) 2 and RSS is the residual sum of squares from the model, 

RSs 

JR 2 = 1--— (see Chapter 4) 

o v> 

Thus the “norm” R 2 judges a model compared with a naive model, where only 
the mean is estimated. In time-series models with strong trends and seasonals, 
this is not a meaningful alternative. The meaningful alternative is a random 
walk with drift, or with seasonal data, a random walk with drift with seasonal 
dummies added. Harvey suggests two alternative R 2 measures. One is 


U-2 <A* - Ay ») 2 

The numerator RSS is the residual sum of squares from the model we estimate. 
The denominator '^J_ 2 (Ay, — Ay,) 2 is the residual sum of squares from a ran¬ 
dom walk with drift, that is, y, = y,_, + (3 + e„ t = 2, . . . , T. For most time- 
series data this is the “naive alternative.” A model for which R 2 D is negative 
should be discarded. To adjust for degrees of freedom, we divide the numerator 
and denominator in R 2 D by the appropriate degrees of freedom. The denomina¬ 
tor d.f. are (T - 2). The numerator d.f. are (T — k), where k is the numbers of 
parameters estimated. The Rj, adjusted for d.f. can be called Rj,. 

With seasonal data, Harvey suggests 


* 1 = 1 - 


RSS 

RSS 0 


22 H. Akaike, “Fitting Autoregressions for Prediction,” Annals of the Institute of Statistical 
Mathematics, Vol. 21, No. 2. 1969, pp. 243-247. 

2J A. C. Harvey, “A Unified View of Statistical Forecasting Procedure,” Journal of Forecasting, 
Vol. 3. 1984, pp. 245-275. 
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where RSS is, as before, the residual sum of squares from the model and RSS, 
is the residual sum of squares from the naive model—which in this case is the 
random walk with drift with seasonal dummies, that is. 

Ay, = ai*Si + • • • + a k S k + e, 

where S, ( S 2 , S k are the k seasonal dummies. For quarterly data k = 4, 
and for monthly data k = 12. For most time-series data this naive model fits 
well and any alternative model has to do better than this. A model for which 
R\ < 0 should be discarded as useless. Again, we can adjust the numerator and 
denominator in R 2 S for degrees of freedom and define the resulting measure 
as R 2 S . 

As an example, with the airline data mentioned earlier, R 2 = 0.983 and 
RSS = 0.47232. But RSS 0 = 0.19405 (from the random walk with seasonal 
dummies). Hence R 2 S = 1 - 0.47232/0.19405 = - 1.416. This indicates that the 
model with R 2 = 0.983 is indeed weak even though it has a high R 2 . 

As yet another example, consider the Nelson and Plosser data on per capita 
real GNP (PCRGNP) in Table 13.3 at the end of the chapter. In this case, 
2 (Ay, - Ay^) 2 is 1.057 and the Rj, for the model y, = a + (3/ + e, is -3.13. 
This suggests that the model y, = a + pr + e, with R 2 = 0.86 is inadequate. 

Let us illustrate the procedure of fitting autoregressions of different order for 
these data. Figures 13.8 and 13.9 show the correlograms for the logarithm of 
PCRGNP in levels as well as first differences. The correlograms suggest that 
the series are nonstationary in the levels form but stationary in the first differ- 



Lag 


Figure 13.8. Correlogram for the Nelson-Plosser data on real per capita GNP. 
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Figure 13.9. Correlogram for first differences of real per capita GNP. 


ences. Also, the fluctuation in the correlogram suggests that AR models are 
appropriate. We therefore fitted autoregressive models AR(1), AR(2), AR(3), 
and so on, for Ay,. The appropriate order of the autoregression can be chosen 
on the basis of the partial autocorrelations. The partial autocorrelations of or¬ 
der 1, 2, 3 and so on, are defined as the last coefficients in the AR(1), AR(2), 
AR(3), and so on, regressions. It has been proved that each of the partial au¬ 
tocorrelations has a SE = 2/\/N. We can also compute the Rj, for these models 
as well as the model y, = a + (3/ + b,. Table 13.1 shows the partial autocor* 


Table 13.1 


p 

Partial 

Autocorrelation 

R% 


Akaike’s FPE 

1 

0.3315 

0.1097 


0.0041 

2 


0.1155 

0.0523 

0.0042 

3 

-0.1799 

0.1466 

0.0518 

0.0043 

4 

-0.1101 

0.1624 


0.0045 

5 

-0.0917 

0.1911 


0.0046 

6 


0.1996 


0.0048 

1 


0.2112 



8 


0.2472 



9 

-0.1891 

0.3045 


0.0049 
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relations, R 2 D , R 2 D , and Akaike’s FPE for Ay,. The partial autocorrelations, R 2 D , 
and FPEs all suggest that the appropriate order of autoregression is 1. 


Summary 


1. Time series can be classified as stationary and nonstationary. When we 
refer to stationary time series, we mean covariance stationarity. 

2. Some models that are of common use to model stationary series are the 
autoregressive moving-average (ARMA) models. Conditions are given for the 
stability of the AR models and the invertibility of the MA models. If the MA 
model satisfies the invertibility condition, we can write it as an infinite auto¬ 
regression. 

3. AR models can be estimated by OLS, but estimation of MA models is 
somewhat complex. In practice, however, the MA part does not involve too 
many parameters. A grid-search procedure is described for the estimation of 
MA models. 

4. After the estimation of the ARMA model, one has to apply tests for serial 
correlation to check that the errors are white noise. Tests for this are described. 
Also, different ARMA models can be compared using the AIC and BIC criteria. 
For comparing different AR models, Akaike’s FPE criterion can also be 
used. 

5. Another goodness-of-fit measure is R 2 . However, time-series data usually 
have strong trends and seasonals and the R 2 's we get are often very high. It is 
difficult to judge the usefulness of a model by just looking at the high R 2 . Some 
alternative measures have been discussed, and these should be used. 


Exercises 


1. Figure 13.1 shows the graph of the stationary series X, — 0.7A,., + e, with 

e, ~ IN(5, 1). To generate the series we need the starting value X 0 . We took 
it as (x = E(X,). We find it from the equation E(X,) = 0 JE(X,_ ,) + E(e,) 
or p = 0.7p + 5. This gives p = 5/0.3 — 16.67. Using the same X 0 , graph 
the nonstationary series X, = 10 + t + dJX,_ i + e, for t = 1,2,..., 
100. This is called a trend-stationary series. 

2. Starting with the same initial value X 0 graph the difference stationary series 
X, - X,^ v = 10 + e,. 

3. Which of the following AR(2) processes are stable? 

(a) X, = 0.9X,_, - 0.2X,_ 2 + e,. 

(b) X, = 0.8X,_, + 0.4X,_ 2 + e,. 

(c) X, * 1.0Ar,_, - 0.8A,_ 2 + e,. 



554 


13 INTRODUCTION TO TIME-SERIES ANALYSIS 


4. Which of the following MA(2) processes are invertible? 

(a) X, = e, — 0.9e,_! + 0.2 e,_ 2 . 

(b) X, = e, — 1.8e,_! + 0.4 e,_ 2 . 

(c) X, = e, — 0 . 8 e,_, + 0.4e,_ 2 . 

5. Compute the acf for the following AR(2) processes and plot their correlo- 
grams. 

(a) X, = 1.0A,_, - 0.5A ,_ 2 = e,. (The acf is given in the text. Just plot 
the correlogram.) 

(b) X, = 0.9A^_, - 0.2A ,_ 2 + e,. 

6 . Table 13.2 at the end of the chapter gives the Beveridge index, and Figure 
13.2 gives the correlogram for the trend-free index (with trend eliminated 
by the ratio to moving-average method). Plot the correlograms for the raw 
index and the first differences of the series. 

7. Consider the ARM A model 

X, = 1.0A,_, - 0.5A ,„ 2 + e, - 0.9e,_, + 0.2 e ,„ 2 

Express e, as a function of X, and lagged values of X, by expanding e, = 
(1 - 0.9L + 0.2L 2 ) '(1 - l.OL + 0.5L 2 )A, in powers of L. 

8 . For a second-order AR process, show that the (theoretical) partial auto¬ 
correlation coefficient of order 2 is given by (p 2 — p 2 )/( 1 — pf). 

9. Suppose that the correlogram of a time series consisting of 100 observa¬ 
tions has r, = 0.50, r 2 = 0.63, r 3 = -0.10, r 4 = 0.08, r 5 = -0.17, r 6 = 
0.13, r-j = 0.09, r 8 = -0.05, r 9 = 0.12, r w = -0.05. Suggest an ARMA 
model that would be appropriate. [Hint: The SE of each of the correlations 
~ IIVN = 0.10. Values greater than | 2/y/N | are significant. Thus only 
the first 2 are significant. Hence an MA(2) process is appropriate.] 

10. Table 13.7 gives data on monthly short-term interest rate (MR1) and the 
long-term interest rate (MR 240) for the period February 1950-December 
1982. For each of the series estimate a regression on a time trend and sea¬ 
sonal dummies. Compute the R 2 . Using the discussion of the different R 1 
measures suggested in Section 13.7, check whether these R 2 ' s are good or 
not. 

11. In Exercise 10 compute the acf and plot the correlogram to determine 
whether the data are trend stationary or difference stationary. If they are 
difference stationary, determine the appropriate order of autoregression for 
the first differences of the two series using the methods in Table 13.1. 

12. The seasonal effect in the two series in Table 13.7 can be taken into account 
by using seasonal dummies or by seasonal differencing as in the Box-Jen- 
kins approach. Which of these two methods is appropriate for these data 
sets? 

13. Repeat Exercise 10 with the quarterly data on consumption and income in 
Table 13.4. 

14. Repeat Exercise 11 with the data in Table 13.4. 

15. Repeat Exercise 12 with the data in Table 13.4. 

16. Illustrate the use of the AIC and BIC criteria in the choice between differ¬ 
ent ARMA models using one of these data sets. 
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Data Sets 


In the following pages we present several time series. Table 13.2 presents data 
on the famous Beveridge price index. The data in Tables 13.3 to 13.7 were 
kindly provided by David N. DeJong. The data in Table 13.8 were provided by 
Harry Vroomen. In addition, time-series data are also provided in Table 4.10. 
All these data sets can be used to experiment with the techniques presented in 
the chapter. 


Table 13.2 Beveridge Trend-Free Wheat Price Index 


Year 

AIN 

TFI 

Year 

AIN 

TFI 

Year 

AIN 

TFI 

1500 

17 

106 

1532 

20 

97 

1564 

32 

78 

1501 

19 

118 

1533 

18 

90 

1565 

47 

112 

1502 

20 

124 

1534 

16 

76 

1566 

42 

100 

1503 

15 

94 

1535 

22 

102 

1567 

37 

86 

1504 

13 

82 

1536 

22 

100 

1568 

34 

77 

1505 

14 

88 

1537 

16 

73 

1569 

36 

80 

1506 

14 

87 

1538 

19 

86 

1570 

43 

93 

1507 

14 

88 

1539 

17 

74 

1571 

55 

112 

1508 

14 

88 

1540 

17 

74 

1572 

64 

131 

1509 

11 

68 

1541 

19 

76 

1573 

79 

158 

1510 

16 

98 

1542 

20 

80 

1574 

59 

113 

1511 

19 

115 

1543 

24 

96 

1575 

47 

89 

1512 

23 

135 

1544 

28 

112 

1576 

48 

87 

1513 

18 

104 

1545 

36 

144 

1577 

49 

87 

1514 

17 

96 

1546 

20 

80 

1578 

45 

79 

1515 

20 

110 

1547 

14 

54 

1579 

53 

90 

1516 

20 

107 

1548 

18 

69 

1580 

55 

90 

1517 

18 

97 

1549 

27 

100 

1581 

55 

87 

1518 

14 

75 

1550 

29 

103 

1582 

54 

83 

1519 

16 

86 

1551 

36 

129 

1583 

56 

85 

1520 

21 

111 

1552 

29 

100 

1584 

52 

76 

1521 

24 

125 

1553 

27 

90 

1585 

76 

110 

1522 

15 

78 

1554 

30 

100 

1586 

113 

161 

1523 

16 

86 

1555 

38 

123 

1587 

68 

97 

1524 

20 

102 

1556 

50 

156 

1588 

59 

84 

1525 

14 

71 

1557 

24 

71 

1589 

74 

106 

1526 

16 

81 

1558 

25 

71 

1590 

78 

111 

1527 

25.5 

129 

1559 

30 

81 

1591 

69 

97 

1528 

25.8 

130 

1560 

31 

84 

1592 

78 

108 

1529 

26 

129 

1561 

37 

97 

1593 

73 

100 

1530 

26 

125 

1562 

41 

105 

1594 

88 

119 

1531 

29 

139 

1563 

36 

90 

1595 

98 

131 


(cant'd) 
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Table 13.2 ( Cont .) 


Year 

AIN 

TFI 

Year 

AIN 

TFI 

Year 

AIN 

TFI 

1596 

109 

143 

1639 

98 

93 

1682 

74 

84 

1597 

106 

138 

1640 

103 

99 

1683 

79 

86 

1598 

87 

112 

1641 

101 

99 

1684 

95 

101 

1599 

77 

99 

1642 

110 

107 

1685 

70 

74 

1600 

77 

97 

1643 

109 

106 

1686 

72 

75 

1601 

63 

80 

1644 

98 

96 

1687 

63 

66 

1602 

70 

90 

1645 

84 

82 

1688 

60 

62 

1603 

70 

90 

1646 

90 

88 

1689 

74 

76 

1604 

63 

80 

1647 

120 

116 

1690 

75 

79 

1605 

61 

77 

1648 

124 

122 

1691 

91 

97 

1606 

66 

81 

1649 

136 

134 

1692 

126 

134 

1607 

78 

98 

1650 

120 

119 

1693 

161 

169 

1608 

93 

115 

1651 

135 

136 

1694 

109 

111 

1609 

97 

94 

1652 

100 

102 

1695 

108 

109 

1610 

77 

93 

1653 

70 

72 

1696 

110 

111 

1611 

83 

100 

1654 

60 

63 

1697 

130 

128 

1612 

81 

99 

1655 

72 

76 

1698 

166 

163 

1613 

82 

100 

1656 

70 

75 

1699 

143 

137 

1614 

78 

94 

1657 

71 

77 

1700 

103 

99 

1615 

75 

88 

1658 

94 

103 

1701 

89 

85 

1616 

80 

92 

1659 

95 

104 

1702 

76 

72 

1617 

87 

100 

1660 

110 

120 

1703 

93 

88 

1618 

72 

82 

1661 

154 

167 

1704 

82 

77 

1619 

65 

73 

1662 

116 

126 

1705 

71 

66 

1620 

74 

81 

1663 

99 

108 

1706 

69 

64 

1621 

91 

99 

1664 

82 

91 

1707 

75 

69 

1622 

115 

124 

1665 

76 

85 

1708 

134 

125 

1623 

99 

106 

1666 

64 

73 

1709 

183 

175 

1624 

99 

106 

1667 

63 

74 

1710 

113 

108 

1625 

115 

121 

1668 

68 

80 

1711 

108 

103 

1626 

101 

105 

1669 

64 

74 

1712 

121 

115 

1627 

90 

84 

1670 

67 

78 

1713 

139 

134 

1628 

95 

97 

1671 

71 

83 

1714 

109 

108 

1629 

108 

109 

1672 

72 

84 

1715 

90 

90 

1630 

147 

148 

1673 

89 

106 

1716 

88 

89 

1631 

112 

114 

1674 

114 

134 

1717 

88 

89 

1632 

108 

108 

1675 

102 

122 

1718 

93 

94 

1633 

99 

97 

1676 

85 

102 

1719 

106 

107 

1634 

96 

92 

1677 

88 

107 

1720 

89 

89 

1635 

102 

97 

1678 

97 

115 

1721 

79 

79 

1636 

105 

98 

1679 

94 

113 

1722 

91 

91 

1637 

114 

105 

1680 

88 

104 

1723 

96 

94 

1638 

103 

97 

1681 

79 

92 

1724 

111 

110 
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Table 13.2 ( Cont .) 


Year 

AIN 

TFI 

Year 

AIN 

TFI 

Year 

AIN 

TFI 

1725 

112 

111 

1767 

148 

113 

1809 

208 

94 

1726 

104 

103 

1768 

142 

108 

1810 

226 

104 

1727 

94 

94 

1769 

143 

108 

1811 

302 

140 

1728 

98 

101 

1770 

176 

131 

1812 

261 

121 

1729 

88 

90 

1771 

184 

136 

1813 

207 

96 

1730 

94 

96 

1772 

164 

119 

1814 

209 

96 

1731 

81 

80 

1773 

146 

106 

1815 

280 

130 

1732 

77 

76 

1774 

147 

105 

1816 

381 

178 

1733 

84 

84 

1775 

124 

88 

1817 

266 

126 

1734 

92 

91 

1776 

119 

84 

1818 

197 

94 

1735 

96 

94 

Mil 

135 

94 

1819 

Ml 

86 

1736 

102 

101 

1778 

125 

87 

1820 

170 

84 

1737 

95 

93 

1779 

116 

79 

1821 

152 

76 

1738 

98 

91 

1780 

132 

87 

1822 

156 

77 

1739 

125 

122 

1781 

133 

88 

1823 

141 

71 

1740 

162 

159 

1782 

144 

94 

1824 

142 

71 

1741 

113 

110 

1783 

145 

94 

1825 

137 

69 

1742 

94 

90 

1784 

146 

92 

1826 

161 

82 

1743 

85 

81 

1785 

138 

85 

1827 

189 

93 

1744 

89 

84 

1786 

139 

84 

1828 

226 

114 

1745 

109 

102 

1787 

154 

93 

1829 

194 

103 

1746 

110 

102 

1788 

181 

108 

1830 

217 

110 

1747 

109 , 

100 

1789 

185 

108 

1831 

199 

105 

1748 

120 

109 

1790 

151 

86 

1832 

151 

82 

1749 

116 

104 

1791 

139 

78 

1833 

144 

80 

1750 

101 

90 

1792 

157 

87 

1834 

138 

78 

1751 

113 

99 

1793 

155 

85 

1835 

145 

82 

1752 

109 

95 

1794 

191 

103 

1836 

156 

88 

1753 

105 

90 

1795 

248 

130 

1837 

184 

102 

1754 

94 

80 

1796 

185 

95 

1838 

216 

117 

1755 

102 

85 

1797 

168 

84 

1839 

204 

107 

1756 

141 

117 

1798 

176 

87 

1840 

186 

95 

1757 

135 

112 

1799 

243 

120 

1841 

197 

101 

1758 

118 

95 

1800 

289 

139 

1842 

183 

92 

1759 

115 

91 

1801 

251 

117 

1843 

175 

88 

1760 

111 

88 

1802 

232 

105 

1844 

183 

92 

1761 

127 

100 

1803 

207 

94 

1845 

230 

115 

1762 

124 

97 

1804 

276 

125 

1846 

278 

139 

1763 

113 

88 

1805 

250 

114 

1847 

179 

90 

1764 

122 

95 

1806 

216 

98 

1848 

161 

80 

1765 

130 

101 

1807 

205 

93 

1849 

150 

74 

1766 

137 

106 

1808 

206 

94 

1850 

159 

78 


(tent’d) 
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Table 13.2 {Cont.) 


Year 

AIN 

TFI 

Year 

AIN 

TFI 

Year 

AIN 

TFI 

1851 

180 

86 

1858 

180 

81 

1864 

179 

81 

1852 

223 

105 

1859 

215 

97 

1865 

210 

94 

1853 

294 

138 

1860 

258 

116 

1866 

268 

119 

1854 

300 

141 

1861 

236 

107 

1867 

267 

118 

1855 

297 

138 

1862 

202 

92 

1868 

208 

93 

1856 

232 

107 

1863 

174 

79 

1869 

224 

102 

1857 

179 

82 








Note: The actual index number (AIN) is the mean of the index numbers for the countries and 
places included, each of these being reduced to basis: mean of prices for 1700 - 1745 = 100. 

The next column, TFI, ( trend-free index), shows the actual index number for each year as 
a percentage of the mean of the actual index numbers for 31 years of which that year is the 
center. 

Source: W. H. Beveridge, “Weather and Harvest Cycles,” Economic Journal, Vol. 31, 1921, 
pp. 429-452. 
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1899 0.0 0 0 6.1 26,911 6.5 23.6 25.0 0 0.00 6.1 2.49 0.00 6.29 

1900 0.0 0 0 6.3 27.909 5.0 24.7 25.0 487 19.48 6.6 2.49 3.30 6.15 

1901 0.0 0 0 7.1 28,922 4.0 24.5 25.0 511 20.44 7.5 2.44 3.25 7.84 

1902 0.0 0 0 7.9 29,800 3.7 25.4 26.0 537 20.65 8.2 2.31 3.30 8.42 

1903 0.0 0 0 8.1 30,506 3.9 25.7 27.0 548 20.30 8.7 2.29 3.45 7.21 





1904 0.0 0 0 7.8 30,771 5.4 26.0 27.0 538 19.93 9.2 2.14 3.60 7.05 

1905 0.0 0 0 9.2 31,976 4.3 26.5 27.0 561 20.78 10.2 2.13 3.50 8.99 

1906 0.0 0 0 9.8 33,749 1.7 27.2 28.0 577 20.61 11.1 2.28 3.55 9.64 

1907 0.0 0 0 10.0 34,371 2.8 28.3 29.0 598 20.62 11.6 2.28 3.80 7.84 

1908 0.0 0 0 8.5 33,246 8.0 28.1 28.0 548 19.57 11.4 2.06 3.95 7.78 
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75 2 1.427 4.134 80 3 1.550 4.469 

75 3 1.429 4.060 80 4 1.557 4.502 

75 4 1.431 4.082 81 1 1.565 4.548 

_76_1_ 1.456 _ 4.139 81 _2_ 1.572 _ 4.530 

■Real per capita consumption on nondurables and real per capita disposable income. 










Table 13.5 Quarterly Data on GNP and Money" 


Year 

Quarter 

GNP 

QM2 

Year 

Quarter 

GNP 

QM2 

59 

1 


287.83 


2 

986.3 

594.07 

59 

2 


292.17 


3 


606.30 


3 

489.0 

296.10 

70 

4 

1009.0 

622.73 

59 

4 

495.0 

297.20 

71 

1 

1049.3 

642.67 


1 

506.9 

298.73 

71 

2 

1068.9 

668.43 


2 

506.3 

301.10 

71 

3 

1086.6 

687.80 


3 

508.0 

306.47 

71 

4 

1105.8 

706.67 


4 

504.8 

310.97 

72 

1 

1142.4 

727.87 

61 

1 

508.2 

316.33 

72 

2 

1171.7 

746.97 

61 

2 

519.2 

322.17 

72 

3 

1196.1 

771.70 

61 

3 

528.2 

327.57 

72 

4 

1233.5 

797.10 

61 

4 

542.6 

333.40 

73 

1 

1283.5 

816.73 

62 

1 

554.2 

340.30 

73 

2 

1307.6 

830.50 

62 

2 

562.7 

347.47 

73 

3 

1337.7 

843.13 

62 

3 

568.9 

352.87 

73 

4 

1376.7 

854.63 

62 

4 

574.3 

359.97 

74 

1 

1387.7 

870.80 

63 

1 

582.0 

367.97 

74 

2 

1423.8 

881.83 

63 

2 

590.7 

375.97 

74 

3 

1451.6 

891.50 

63 

3 

601.8 

383.60 

74 

4 

1473.8 

904.63 

63 

4 

612.4 

391.10 

75 

1 

1479.8 

921.27 

64 

1 

625.3 

397.60 

75 

2 

1516.7 

955.07 

64 

2 

634.0 

404.33 

75 

3 

1578.5 

989.80 

64 

3 

642.8 

413.47 

75 

4 

1621.8 

1014.17 

64 

4 

648.8 

422.10 

76 

1 

1672.0 

1046.23 

65 

1 

668.8 

430.47 

76 

2 

1698.6 

1078.87 

65 

2 

681.7 

437.57 

76 

3 

1729.0 

1108.33 

65 

3 

696.4 

445.97 

76 

4 

1772.5 

1149.30 

65 

4 

717.2 

456.07 

77 

1 

1839.1 

1188.27 

66 

1 

738.5 

464.67 

77 

2 

1893.9 

1221.07 

66 

2 

750.0 

470.17 

77 

3 

1950.4 

1250.30 

66 

3 

760.6 

473.00 

77 

4 

1988.6 

1278.00 

66 

4 

774.9 

477.87 

78 

1 

2032.4 

1302.00 


1 

780.7 

485.43 

78 

2 

2129.6 

1325.40 


2 

788.6 

497.13 

78 

3 

2190.5 

1351.23 


3 

805.7 

510.83 

78 

4 

2271.9 

1379.83 


4 

823.3 

521.37 

79 

1 

2340.6 

1403.03 

68 

1 

841.2 

530.23 

79 

2 

2374.6 

1435.63 

68 

2 

867.2 

539.17 

79 

3 

2444.1 

1470.60 

68 

3 

884.9 

549.90 

79 

4 

2496.3 

1491.80 

68 

4 

900.3 

562.20 

■■ 

1 

2571.7 

1518.47 

69 

1 

921.2 

571.67 

HI 

2 

2564.8 

1535.30 

69 

2 

937.4 

577.07 

HI 

3 

2637.3 

1588.70 

69 

3 

955.3 

580.93 

■■ 

4 

2730.6 

1626.07 

69 

4 

962.0 

586.80 

81 

1 

2853.0 

1654.40 


1 

972.0 

589.67 






"Nominal GNP and the nominal money supply measured by M2. 
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13 INTRODUCTION TO TIME-SERIES ANALYSIS 


Table 13.6 Real Stock Price (S&P) and Real Dividends (S&P), 1871-1985“ 


Year 

Real 

Price 

Real 

Dividends 

Year 

Real 

Price 

Real 

Dividends 

1871 

10.05 

0.585 

1900 

22.76 

1.108 

1872 

10.63 

0.656 

1901 

27.19 

1.212 

1873 

11.26 

0.720 

1902 

30.64 

1.180 

1874 

10.76 

0.759 

1903 

30.32 

1.237 

1875 

11.18 

0.744 

1904 

24.56 

1.060 

1876 

11.86 

0.793 

1905 

30.77 

1.169 

1877 

9.67 

0.516 

1906 

34.03 

1.359 

1878 

9.79 

0.543 

1907 

31.14 

1.423 

1879 

11.26 

0.636 

1908 

22.68 

1.354 

1880 

14.39 

0.731 

1909 

30.40 

1.370 

1881 

17.99 

0.916 

1910 

31.60 

1.422 

1882 

16.91 

0.912 

1911 

29.90 

1.551 

1883 

17.55 

1.011 

1912 

29.04 

1.480 

1884 

17.21 

1.034 

1913 

27.43 

1.451 

1885 

15.25 

0.877 

1914 

25.21 

1.309 

1886 

19.05 

0.789 

1915 

22.53 

1.297 

1887 

20.29 

0.891 

1916 

24.36 

1.331 

1888 

18.76 

0.802 

1917 

18.44 

1.188 

1889 

19.19 

0.789 

1918 

11.55 

0.891 

1890 

20.15 

0.826 

1919 

11.63 

0.779 

1891 

18.40 

0.853 

1920 

11.18 

0.687 

1892 

22.31 

0.977 

1921 

12.32 

0.959 

1893 

22.17 

0.989 

1922 

15.60 

1.038 

1894 

19.03 

0.931 

1923 

16.82 

1.049 

1895 

18.48 

0.818 

1924 

17.11 

1.121 

1896 

19.41 

0.837 

1925 

19.78 

1.158 

1897 

19.18 

0.825 

1926 

23.60 

1.369 

1898 

21.22 

0.852 

1927 

26.80 

1.601 

1899 

24.62 

0.851 

1928 

35.27 

1.745 
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Table 13.6 ( Cont.) 


Year 

Real 

Price 

Real 

Dividends 

Year 

Real 

Price 

Real 

Dividends 

1929 

49.52 


1958 

43.61 


1930 

45.14 


1959 

58.67 


1931 

40.15 



61.28 


1932 

23.85 


1961 

62.73 

2.138 

1933 

22.44 


1962 

72.71 

2.247 

1934 

28.18 


1963 


2.413 

1935 

22.70 


1964 



1936 

33.00 


1965 


2.816 

1937 

39.62 


1966 

94.65 

2.876 

1938 

26.99 


1967 

84.37 


1939 

31.41 


1968 


2.995 

1940 

29.93 


1969 

97.79 

2.967 

1941 

25.24 



82.63 

2.844 

1942 

17.97 


1971 

83.62 

2.695 

1943 

19.15 


1972 

88.82 

2.645 

1944 

22.19 


1973 



1945 

24.84 


1974 


2.249 

1946 

32.53 

1.164 

1975 

42.24 


1947 

20.78 


1976 

53.99 

2.213 

1948 

17.89 


1977 

55.18 


1949 

18.82 


1978 


2.422 

1950 

21.75 


1979 

44.53 

2.398 

1951 

23.26 




2.292 

1952 

26.97 

whig* * ' 

1981 

46.69 


1953 

30.02 


1982 

39.32 

2.295 

1954 

28.93 


1983 

48.11 

2.339 

1955 

40.73 


1984 


2.427 

1956 

49.72 


1985 

55.45 


1957 

49.01 






“Stock prices and dividends corresponding to the Standard and Poor’s 500 index. 















Table 13.7 Monthly Data on Interest Rates 
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(cont'd) 



Table 13.7 (Cont.) 
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.971 7.50 

(coni’d) 



Table 13.7 ( Coni .) 
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“Short- and long-term interest rates as measured by the 1-month Treasury bill rate and the monthly yields to maturity of 20-year Treasury 
bonds. 

Source: David N. DeJong, “Co-integration and Trend Stationary in Macroeconomic Time Series: Evidence from the Likelihood Function,’ 
Working Paper 89-05, University of Iowa, February 1989. 
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Table 13.8 Acreage and Yields on Corn, Soybeans, and Wheat 

Acres Yields 


Year 

Corn 

Soy 

Wheat 

Corn 

Soy 

Wheat 

1930 

103,915 

3,072 

67,559 

20.5 

13.0 

14.2 

1931 

109,364 

3,835 

66,463 

24.5 

15.1 

16.3 

1932 

113,024 

3,704 

66,281 

26.5 

15.1 

13.1 

1933 

109,830 

3,537 

69,009 

22.8 

12.9 

11.2 

1934 

100,563 

5,769 

64,064 

18.7 

14.9 

12.1 

1935 

99,974 

6,966 

69,611 

24.2 

16.8 

12.2 

1936 

101,959 

6,127 

73,970 

18.6 

14.3 

12.8 

1937 

97,174 

6,332 

80,814 

28.9 

17.9 

13.6 

1938 

94,473 

7,318 

78,981 

27.8 

20.4 

13.3 

1939 

91,639 

9,565 

62,802 

29.9 

20.9 

14.1 

1940 

88,692 

10,487 

61,820 

28.9 

16.2 

15.3 

1941 

86,837 

10,068 

62,707 

31.2 

18.2 

16.8 

1942 

88,818 

13,696 

53,000 

35.4 

19.0 

19.5 

1943 

94,341 

14,191 

55,984 

32.6 

18.3 

16.4 

1944 

95,475 

13,118 

66,190 

33.0 

18.8 

17.7 

1945 

89,261 

13,056 

69,192 

33.1 

18.0 

17.0 

1946 

88,898 

11,706 

71,578 

37.2 

20.5 

17.2 

1947 

85,032 

13,052 

78,314 

28.6 

16.3 

18.2 

1968 

85,522 

11,987 

78,345 

43.0 

21.3 

17.9 

1949 

86,738 

11,872 

83,905 

38.2 

22.3 

14.5 

1950 

82,859 

15,048 

71,287 

38.2 

21.7 

16.5 

1951 

83,275 

15,176 

78,524 

36.9 

20.8 

16.0 

1952 

82,230 

15,958 

78,645 

41.8 

20.7 

18.4 

1953 

81,574 

16,394 

78,931 

40.7 

18.2 

17.3 

1954 

82,185 

18,541 

62,539 

39.4 

20.0 

18.1 

1955 

80,932 

19,674 

58,246 

42.0 

20.1 

19.8 

1956 

77,828 

21,700 

60,655 

47.4 

21.8 

20.2 

1957 

73,180 

21,038 

49,834 

48.3 

23.2 

21.8 

1958 

73,351 

25,108 

56,017 

52.8 

24.2 

27.5 

1959 

82,742 

23,349 

56,706 

53.1 

23.5 

21.6 

1960 

81,425 

24,440 

54,906 

54.7 

23.5 

26.1 

1961 

65,919 

27,787 

55,707 

62.4 

25.1 

23.9 

1962 

65,017 

28,418 

49,274 

64.7 

24.2 

25.0 

1963 

68,771 

29,462 

53,364 

67.9 

24.4 

25.2 

1964 

65,823 

31,605 

55,672 

62.9 

22.8 

25.8 

1965 

65,171 

35,277 

57,361 

74.1 

24.5 

26.5 

1966 

66,347 

37,294 

54,105 

73.1 

25.4 

26.3 

1967 

71,156 

40,819 

67,264 

80.1 

24.5 

25.8 

1968 

65,126 

42,265 

61,860 

79.5 

26.7 

28.4 

1969 

64,264 

42,534 

53,450 

85.9 

27.4 

30.6 

1970 

66,836 

43,082 

48,739 

72.4 

26.7 

31.0 


( cont’d ) 
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Table 13.8 (Cont.) 


Year 


Acres 



Yields 


Corn 

Soy 

Wheat 

Corn 

Soy 

Wheat 

1971 

74,179 

43,476 

53,822 

88.1 

27.5 

33.9 

1972 

67,126 

46,866 

54,913 

97.0 

27.8 

32.7 

1973 

72,253 

56,549 

59,254 

91.3 

27.8 

31.6 

1974 

77,935 

52,479 

71,044 

71.9 

23.7 

27.3 

1975 

78,719 

54,590 

74,900 

86.4 

28.9 

30.6 

1976 

84,588 

50,269 

80,395 

88.0 

26.1 

30.3 

1977 

84,328 

58,978 

75,410 

90.8 

30.6 

30.7 

1978 

81,675 

64,708 

65,989 

101.0 

29.4 

31.4 

1979 

81,394 

71,411 

71,424 

109.5 

32.1 

34.2 

1980 

84,043 

69,930 

80,788 

91.0 

26.5 

33.5 

1981 

84,097 

67,543 

88,251 

108.9 

30.1 

34.5 

1982 

81,857 

70,884 

86,232 

113.2 

31.5 

35.5 

1983 

60,217 

63.779 

76,419 

81.1 

26.2 

39.4 

1984 

80,543 

67,755 

79,213 

106.7 

28.1 

38.8 

1985 

83,348 

63,130 

75,575 

118.0 

34.1 

37.5 

1986 

76,646 

61,835 

72,007 

119.3 

33.1 

34.8 
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14 VECTOR AUTOREGRESSIONS, UNIT ROOTS, AND COINTEGRATION 


l 

14.1 Introduction 


In Chapter 13 we discussed time-series analysis as it stood about two decades 
ago. The main emphasis was on transforming the data to achieve stationarity 
and then estimating ARMA models. The differencing operation used to achieve 
a stationarity involves a loss of potential information about long-run move¬ 
ments. 

In the present chapter we discuss three major developments during the last 
two decades, mainly to handle nonstationary time series: vector autoregres¬ 
sions (VARs), unit roots, and cointegration. The Box-Jenkins method of dif¬ 
ferencing the time series after a visual inspection of the correlogram has been 
formalized in the tests for unit roots. The VAR model is a very useful starting 
point in the analysis of the interrelationships between the different time series. 
The literature on unit roots studies nonstationary time series which are station¬ 
ary in first differences. The theory of cointegration explains how to study the 
interrelationships between the long-term trends in the variables, trends that are 
differenced away in the Box-Jenkins methods. In the following sections we 
discuss the VAR models, unit roots, and cointegration. Unlike previous chap¬ 
ters, in this chapter we use matrix notation at some places. 


14.2 Vector Autoregressions 


In previous sections we discussed the analysis of a single time series. When 
we have several time series, we need to take into account the interdependence 
between them. One way of doing this is to estimate a simultaneous equations 
model as discussed in Chapter 9 but with lags in all the variables. Such a model 
is called a dynamic simultaneous equations model. However, this formulation 
involves two steps: first, we have to classify the variables into two categories, 
endogenous and exogenous and second, we have to impose some constraints 
on the parameters to achieve identification. Sims' argues that both these steps 
involve many arbitrary decisions and suggests as an alternative, the vector au¬ 
toregression (VAR) approach. This is just a multiple time-series generalization 
of the AR model. 2 The VAR model is easy to estimate because we can use the 
OLS method. 

Consider two economic time series y„ and y 2r If we are considering the re¬ 
lationship between money growth and inflation rate, then y„ and y 2l are, re- 


‘C. A. Sims, “Macroeconomics and Reality,” Econometrica, Vol. 48, January 1980, pp. 1-48. 
The multiple time-series analog of the ARMA model is the VARMA model. But given the 
complexities in the estimation of MA models discussed earlier, it is clear that estimation of 
VARMA model is still more complicated for our purpose. The VARMA model was introduced 
by G. C. Tsiao and G. E. P. Box, “Modelling Multiple Time Series with Applications,” Journal 
of the American Statistical Association, Vol. 76, 1981, pp. 802-816. 
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spectively, the rate of growth of money supply and the GNP deflator. The VAR 
model with only one lag in each variable (suppressing constants) would be 

jh, = a u.yi,r-i ai2y 2 .,-i + e K ... .. 

, . (H.I) 

y* = + a 22 y2./-i + £ 2 / 

We can also add lagged values of some “exogenous” variables z„ but then we 
again face the problem of classifying variables as endogenous and exogenous. 
In practice there would often be more than two endogenous variables and often 
more than one lag. In this case with k endogenous variables and p lags, we can 
write the VAR model in matrix notation as 


y, = A l y,_ l + • • • + A p y,_ p + e, 


where y, and its lagged values, and e, are k x 1 vectors and A„ ... , A p are 
k x k matrices of constants to be estimated. 

To fix ideas, let’s go back to the two-equations system (14.1). We can write 
the system in terms of the lag operator L as 


1 — a n L 

-<*12 L 


yu 


El, 

— cl 2I L 

1 — a 22 L_ 


y*. 




This gives the solution 


yu _ 

l 

- a u L 

-a n L 

-1 

El, 



(X,|f 1 

— a 22 L_ 


_ E 2'_ 


l 

1 — cl 22 L 

a.\ 2 L 



El, 



<*21 L 

1 — a u j 

L 


_ E 2,_ 


where 


(14.2) 


A = (1 - a n L)(l - a n L) - (a l2 L)(a n L) 

= 1 - (<*n + ol 22 )L + (<*„a 22 - a 12 a 2l )L 2 


= (1 - k,T)(l - K 2 L) say 

where and \ 2 are the roots of the equation 

\ 2 - (a„ + a 22 )k + (<*i,<* 22 - a J2 a 21 ) = 0 

In order that we have a convergent expansion for y u and y 2t in terms of e„ 
and e 2 , we should have |\,| < 1 and |\ 2 | < 1. Following the results in the ap¬ 
pendix to Chapter 7 we note that this is the same condition that the roots of 
]A — XI| = 0 are less than one in absolute value, where A is the matrix of the 
lag coefficients 

|<*11 <*12 
' L <*21 <*22 _ 

Once the condition for stability is satisfied, we can express y„ (and y 2l ) as func¬ 
tions of the current and lagged values of e t , and e 2l . These are known as the 
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impulse response functions. They show the current and lagged effects over time 
of changes in e I( and e 2( on y 1( and y 2r For instance, in the simple model of 
money (y„) and prices (y 2( ), we have, from (14.2), 

y„ = A~' [(1 - a 22 L)e„ + a n Ls 2 ,} 

and expanding A -1 in powers of L and gathering the expressions with the same 
powers of L, we get 

y„ = e„ + a„E u _, + (a?, + a 12 a 21 )E 1>( _ 2 + • • • 

+ a, 2 e 2j _, + a 12 (a„ + a 22 )E 2 ,_ 2 + • • • 

with a similar expression for y 2l . Thus a price shock in period t has no effect 
on money until period (/ + 1), and vice versa for the effect of a money shock 
on prices. Of course, these lags are a consequence of the one period lags in the 
VAR model (14.1). 

14.3 Problems with VAR Models 
in Practice 

We have considered only a simple model with two variables and only one lag 
for each. In practice, since we are not considering any moving-average errors, 
the autoregressions would probably have to have more lags to be useful for 
prediction. Otherwise, univariate ARMA models would do better. Suppose 
that we consider say six lags for each variable and we have a small system with 
four variables. Then each equation would have 24 parameters to be estimated 
and we thus have 96 parameters to estimate overall. This overparameterization 
is one of the major problems with VAR models. The unrestricted VAR models 
have not been found very useful for forecasting and other extensions using 
some restrictions on the parameters of the VAR models have been suggested. 
One such model that has been found particularly useful in prediction is the 
Bayesian vector autoregression (BVAR). 3 In BVAR we assign some prior dis¬ 
tributions for the coefficients in the vector autoregressions. (See Section 2.5 of 
Chapter 2 for prior distributions.) In each equation, the coefficient of the own 
lagged variable has a prior mean 1, all others have prior means 0, with the 
variance of the prior decreasing as the lag length increases. For instance, with 
two variables y„ and y 2 , and 4 lags for each, the first equation will be 

4 4 

yi» = aiyi^-i + 2 npu-j + 2 fyu-j + e i» 

m2 j=i 

The prior for a, will have mean 1 and variance X with X < 1. The priors for a 2 , 
a 3 , a 4 will have means 0 and variance X 2 , X\ X 4 , respectively. The priors for 


’The RATS computer program is the one commonly used for the estimation of VAR and BVAR 
models. 
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p 2 , 0 3 , P 4 will have means 0 and variances X, X 2 , X 3 , X 4 , respectively. Since X < 
1, as the lag length increases, the variance decreases; that is, we are more and 
more sure that the coefficient is zero. With the second equation, the priors are 
similar. The coefficient of y 2 _,_, will have a prior with mean 1. All other coef¬ 
ficients will have prior 0, with the coefficients of the distant lags having priors 
more tightly concentrated around zero. 

The example of priors above is just meant to illustrate the flavor of the B VAR 
method. Other priors can be incorporated using the RATS program. The prac¬ 
tical experience with the BVAR model has been very good. It has produced 
better forecasts than many structural simultaneous equation models. 4 It has, 
however, been criticized as being “a-theoretical econometrics” because it just 
does not use any economic theory. Sims criticized the traditional simultaneous 
equations models on the ground that they relied on ad hoc restrictions on the 
parameters to achieve identification. However, the BVAR model brings in some 
restrictions through the back door. The question is: What interpretation can be 
given to these restrictions? 5 Estimation of the VAR and BVAR models is left 
as an exercise using some of the data sets at the end of Chapter 13. 

14.4 Unit Roots 


The single topic in the 1980s that attracted the most attention and to which 
most econometricians have devoted their energies is that of testing for unit 
roots. The number of papers on this topic runs into the hundreds. 6 We have 
given a brief introduction to this in Section 6.10, where we discussed the dif¬ 
ference between trend stationary (TS) and difference stationary (DS) time se¬ 
ries and the Dickey-Fuller test. We review the problems in greater detail here. 

The issue of whether a time series is TS or DS has both economic and sta¬ 
tistical implications. If a series is DS, the effect of any shock is permanent. For 
instance, consider the model 

y, = y ( -i + 

where e, is a zero-mean stationary process. Suppose that in some time period, 
say, y T , there is a jump C in e r . Then y T , y T+ ,, y T+2 , ... all increase by C. Thus 
the effect of the shock C is permanent. On the other hand, if we have the model 

y, = + e ( M < i 


4 R. B. Litterman, “Forecasting with Bayesian Vector Autoregression: Five Years of Experi¬ 
ence,” Journal of Business and Economic Statistics, Vol. 4, 1986, pp. 25-38. 

’There have been other procedures of imposing constraints on the coefficients in the vector 
autoregressions that give them a structural interpretation. An example is O. J. Blanchard, “A 
Traditional Interpretation of Macroeconomic Fluctuations,” American Economic Review, Vol. 
79, 1989, pp. 1146-1164. 

6 See F. X. Diebold and M. Nerlove, “Unit Roots in Economic Time Series: A Selective Sur¬ 
vey,” in T. Fomby and G. Rhodes (eds.). Advances in Econometrics, Vol. 8 (Greenwich, Conn.: 
JAI Press, 1990). This “selective” survey lists more than 200 papers in the 1980s. 
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the effect of the shock fades away over time. Starting with y T , which will jump 
by C, successive values of y, will increase by Ca, Ca 2 , Ca\ .... Since mon¬ 
etary shocks probably do not have a permanent effect on GNP, if real GNP is 
DS, fluctuations in real GNP have to be explained by real shocks, not monetary 
shocks. Thus the issue of whether in the autoregression, y, — ay, + e„ the 
root ct is equal to 1 or < 1, that is, whether there is a unit root or not, is very 
important for macroeconomists. 

On the statistical side, there are two issues: The first is about the trend re¬ 
moval methods used (by regression or by differencing). As pointed out by Nel¬ 
son and Kang (1981) and as discussed in Section 6.10, spurious autocorrelation 
results whenever a DS series is de-trended or a TS series is differenced. 

The second statistical problem is that the distribution of the least squares 
estimate of the autoregressive parameter a has a nonstandard distribution (not 
the usual normal, t or F) when there is a unit root. This distribution has to be 
computed numerically on a case-by-case basis, depending on what other vari¬ 
ables are included in the regression (constant term, trend, other lags, and so 
on). This accounts for the proliferation of the unit root tests and the associated 
tables. 

Returning to the economic issue, it does not really make sense to hinge an 
economic theory on a point estimate of a parameter, on whether or not there 
is a unit route (i.e., a = 1). A model with a = 0.95 is really not statistically 
distinguishable from one with ct = 1 in small sample. 7 The relevant question is 
not whether a = 1 or not, but how big the autoregressive parameter is or how 
long it takes for shocks in GNP to die out. 8 Cochrane argued that GNP does 
revert toward a “trend” following a shock, but that this reversion occurs over 
a time horizon characteristic of business cycles—several years at least. Yet 
another point is that as we noted earlier, the effect of a shock is permanent if 
a = 1 and goes to zero progressively if a < 1. However, in many economic 
problems, what we are concerned with is the present value of future streams 
of y t . If the discount factor is (3. the present value of the effect of a shock C is 
C/(l — Pa). Without discounting (p = 1) this is finite if a < 1 and infinite with 
a = 1. Thus the unit root makes a difference. But if p < 1 (i.e., with discount¬ 
ing), the effect is finite for all a < (1/p). Thus the existence of a unit root is not 
important. 9 


14.5 Unit Root Tests 

Consider first the model 


y, = ay,-I + e, 


7 J. H. Cochrane, “A Critique of the Application of the Unit Root Tests,” Journal of Economic 
Dynamics and Control , Vol. 15, No. 2, 1991, pp. 275-284. 

8 J. H. Cochrane, “How Big Is the Random Walk in GNP?” Journal of Political Economy, Vol. 
96, 1988, pp. 893-920. 

’See S. R. Blough, “Unit Roots, Stationarity and Persistence in Finite Sample Macroecono¬ 
metrics,” Discussion Paper, Johns Hopkins University, 1990. 
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where e, is white noise. In the random walk case (a = 1) it is well known that 
the OLS estimation of this equation produces an estimate of a that is biased 
toward zero. However, the OLS estimate is also biased toward zero when a is 
less than but near to 1. Evans and Savin (1981, 1984) 10 provide Monte Carlo 
evidence on the bias and other aspects of the distributions. 

To discuss the Dickey-Fuller tests, consider the model 

y, = Po + Pd + 
u, = au,_i + e, 

where e, is a covariance stationary process with zero mean. The reduced form 
for this model is 

y, = y + bt + cty,_! + e, (14.3) 

where y = p 0 (l - a) + p,a and 8 = p,(l - a). This equation is said to have 
a unit root if a = 1 (in which case 8 = 0). 

Dickey—Fuller Test 

The Dickey-Fuller tests are based on testing the hypothesis a = 1 in (14.3) 
under the assumption that e, are white noise errors. There are three test statis¬ 
tics 

K( 1) = T(a - 1) t(l) = F(0, 1) 

SE(a) 

where a is the OLS estimate of a from (14.3), SE(a) is the standard error of 
a, and F(0, 1) is the usual F-statistic for testing the joint hypothesis 8 = 0 and 
a = 1 in (14.3). These statistics do not have the standard normal, t, and F 
distributions. The critical values for Ffl) and f(l) are tabulated for 8 = 0 in 
Fuller (1976) and the critical values for the F(0, 1) statistic are tabulated in 
Dickey and Fuller (1981). 11 

The Serial Correlation Problem 

Dickey and Fuller, Said and Dickey (1984), Phillips (1987), Phillips and Perron 
(1988 ), 12 and others developed modifications of the Dickey-Fuller tests when 
e, is not white noise. These tests, called the “augmented” Dickey-Fuller (ADF) 
tests, involve estimating the equation 

k 

y, = y + 8/ + ay,_i + 2 e Av,-., + e , (l 4 - 4 ) 

10 G. Evans and N. E. Savin, “Testing for Unit Roots: I,” Econometrica, Vol. 49, 1981, pp. 753- 
779, and “Testing for Unit Roots: II,” Econometrica, Vol. 52, 1984, pp. 1241-1269. 

"W. A. Fuller, Introduction to Statistical Time Series (New York: Wiley, 1976), Table 8.5.2. 
The reference to Dickey and Fuller and excerpts from their tables are shown in Section 6.10. 
,2 S. E. Said and D. A. Dickey, “Testing for Unit Roots in ARMA Models of Unknown Order,” 
Biometrika, Vol. 71, 1980, pp. 599-607; P. C. B. Phillips, "Time Series Regression with a Unit 
Root.” Econometrica. Vol. 55, 1987, pp. 277-302; P. C. B. Phillips and P. Perron, “Testing for 
a Unit Root in Time Series Regression,” Biometrika, Vol. 75, 1988, pp. 335-346. 
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The purpose in adding the terms A y,_, is to allow for ARM A error processes. 
But if the MA parameter is large, the AR approximation would be poor unless 
A is large. 1 ’ 

After estimating this augmented equation, the tests A(l), /(l), and F(0, 1) 
discussed earlier are used. These test statistics have been shown to have 
asymptotically the same distribution as the Dickey-Fuller test statistics and 
hence the same significance tables can be used. 

The Dickey-Fuller (DF) and the augmented Dickey-Fuller (ADF) tests were 
very popular methods of testing for unit roots. Recently two other test statis¬ 
tics, the Z a and the Z, test statistics, suggested by Phillips, are also being com¬ 
monly used. These statistics are based on a nonparametric modification of the 
Dickey-Fuller tests and, hence, are beyond the scope of our discussion. 

Powers of Unit Root Tests 

For many economic time series the DF, ADF, Z a , and Z, tests have consistently 
shown that the null hypothesis of a unit root (that is a = 1) cannot be rejected. 
This has led to the conclusion that almost all economic time series are DS type. 
This result has been attributed to the fact that the tests have very low power 
against relevant TS (trend stationary) alternatives. They do not reject the hy¬ 
pothesis a = 1, but they do not reject the hypothesis a = 0.95 either. 

Another problem that has been pointed out by Choi 14 is that if the errors have 
a moving-average component, the long autoregressions used in (14.4) to ac¬ 
count for serial correlation biases the OLS estimate & of a toward 1, thus sug¬ 
gesting the presence of a unit root, even when |a| < 1. 

Nelson and Plosser (1982) start with equation (14.3) and test a = 1 under the 
restriction 8 = 0. For the errors e, they assume that 

e, = e, + |<}>| < 1, e,~ 1ID(0, ct 2 ) (14.5) 

after observing the sample autocorrelations of the first differences of U.S. his¬ 
torical data. However, instead of estimating this model, they used (14.4) to 
account for the moving-average errors. These long autoregressions produce an 
upward-biased estimate of a. Choi performed a Monte Carlo study assuming 
that y = 5, 8 = 1, a = 0.5 in (14.4), and 4> = 0.3 in (14.5) and found that a 
from (14.4) is biased toward 1. 

What Are the Null and Alternative Hypotheses 
in Unit Root Tests? 

The null hypothesis in unit root tests is H 0 : a = 1. In the theory of testing of 
hypothesis, the null hypothesis and the alternative are not on the same footing. 

n G. W. Schwert, “Tests for Unit Roots: A Monte Carlo Investigation,” Journal of Business 
and Economic Statistics. Vol. 7, 1989, pp. 147-159. His conclusion is that the best test is the 
ADF with a long k. See, however, the paper by Choi cited in the next footnote, who argues 
against the long autoregressions. 

14 1. Choi, “Most U.S. Economic Time Series Do Not Have Unit Roots: Nelson and Plosser’s 
(1982) Results Reconsidered,” Manuscript, Ohio State University, 1990. 
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The null hypothesis is on a pedestal and it is rejected only when there is over¬ 
whelming evidence against it. That is why one uses a 5% or 1% significance 
level. 15 If, on the other hand, the null hypothesis and alternative were to be 

H 0 : y, is stationary and H x : y, is nonstationary 

the conclusions would be quite different. In the Bayesian approach, H 0 and H x 
are on the same footing and hence this asymmetry does not arise. 16 Tests for 
unit roots with the null hypothesis being stationary (no unit root) have also 
been developed and they often give results contrary to those of the unit root 
tests with the unit root as the null. 17 

In the Box-Jenkins approach, whether or not there is a unit root is decided 
by visual inspection of the correlogram. If the correlogram tails off, the time 
series is considered to be stationary. Otherwise, we examine the correlogram 
of the first differences. The unit root tests are just a formalization of this visual 
inspection. However, if the unit root tests and the Box-Jenkins approach give 
conflicting results, it is important to examine why this is so. In practice, it 
sometimes happens that a unit root test does not reject the null hypothesis of 
unit root (at the traditional significance levels), although the correlogram tapers 
off. Hence it is important to examine the correlogram before applying any unit 
root tests. 18 

In tests for unit roots it is important to examine the specification of the al¬ 
ternatives. When we say a unit root test has low power, we have to specify for 
what alternative it has low power. For example, the Phillips’ test is based on 
the OLS estimator & of the parameter a in the model y, = ay, _, + u„ whereas 
the Phillips-Perron test is based on the OLS estimator a of a in the model y, 
= p, + ay,_] + u,. In both cases, the null hypotheses H 0 is the same, namely 
Ay, = u,. The alternative H x is y, = u, in the Phillips test and y, = p, + u, in 
the Phillips-Perron test. 

Very often the alternative to the simple random walk model (a DS model) is 


,5 It is of historical significance that the widely used 5% and 1% levels were originally suggested 
in a paper by R. A. Fisher in 1926 in the Journal of Ministry of Agriculture in Great Britain. 
Fisher was a conservative and did not want to change a treatment unless there was overwhelm¬ 
ing evidence that the new treatment was better. Fisher is the father of modern statistics and his 
prescriptions have been followed without any question ever since. 

I6 D. N. DeJong and C. H. Whiteman, “Reconsidering Trends and Random Walks in Macroeco¬ 
nomic Time Series,” Journal of Monetary Economics, Vol. 28, No. 2, Oct. 1991, used a Baye¬ 
sian method and found only two of the Nelson-Plosser series to be DS type. 

17 For tests with the no unit root as null, see J. H. Kahn and M. Ogaki, “A Chi-square Test for 
a Unit Root,” Economic Letters, Vol. 34, 1990. pp. 37-42: J. Y. Park and B. Choi, “A New 
Approach to Testing for a Unit Root,” CAE Working Paper 88-23, Cornell University, 1988; H. 
Bierens, “Testing Stationarity Against the Unit Root Hypothesis,” Manuscript, Free Univer¬ 
sity of Amsterdam, 1990; and D. Kwiatkowski, P. C. B. Phillips, and P. Schmidt, “Testing the 
Alternative of Stationarity Against the Alternative of a Unit Root; How Sure Are We That 
Economic Time Series Have a Unit Root?” Econometrics Paper, Michigan State University, 
November 1990. 

"Bierens, “Testing Stationarity,” gives an example of the monthly time series on U.S. inflation 
rate, 1961-1987 (204 observations) where the correlogram tapers off but the Phillips and Phil¬ 
lips-Perron tests do not reject the null of unit root. The test by Bierens for which stationarity 
is the null does not reject the null of stationarity. 
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a trend-stationary model. In such cases it can be analytically demonstrated that 
the test has low power. 19 For instance, in the Said and Dickey (1985) paper the 
alternative to a simple random walk model is a stationary process with a non¬ 
zero mean. That is, 

H 0 : a = l in y, = ay,_ , + u, 

H x : |a| < 1 iny, - 8 = a(y,„, - 8) + w, 

In this case, not being able to reject //„ does not mean that there is strong 
evidence against //,. The test for unit root is based on the OLS estimator a 
from the equation y, = ay, _, + u,. Thus a = But under the 

alternative we have, assuming that var(w,) = cr 2 and that u, are serially uncor¬ 
related, 

y, = 8 + (1 - u.L)~ l u, 

In large samples, ]£y 2 .. X /T —> 8 2 + cr 2 /(l — a 2 ) and ^y,y,~ X IT —» 8 2 + acr 2 / 
(1 - a 2 ). Hence, under //,, 

„ g 2 (l ~ a) 

cr 2 ( 1 — a 2 ) + cr 2 

If 8 is very large, a —> 1 under the alternative //,. Hence the test based on a 
has low power against //,. 

The same would be the case where the null hypothesis is a unit root with 
drift and the alternative is a stationary series around a linear time trend, as 
considered by Dickey, Bell, and Miller. 20 We have 

H 0 : a = I in y, = p + ay,_, + u, (p ^ 0) 

H x . |a| < 1 in y, - P, - p 2 t = p + a[y,_i - Pi - P 2 (t - HI + u, 

Again we can show that the OLS estimator a from y, = p + ay,_, + u, will 
tend to 1 under H x and the test based on a will have lower power under H x . In 
both these cases the model size under the alternative is larger than under the 
null, that it, it has more parameters under the alternative than under the null. 
For this reason Choi suggests the following tests: 

Test 1 


H 0 : a = 1 in y, = ay,-.i + u, 

H x . M < 1 

Test 2 

H 0 : a = l,p = 0 in y, = p + ay,_i + u, 

H x : |a| < 1 

l9 Choi, “Most U.S. Economic Time Series.” The discussion here is based on that paper. 

20 D. A. Dickey, W. R. Bell, and R. B. Miller, “Unit Roots in Time Series Models: Tests and 
Applications,” American Statistician, Vol. 40, 1986, pp. 12-26. 
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Test 3 (the Nelson-Plosser test) 

H 0 : a = 1, 3 = 0 in y, = p. + (3/ + ay,.., + u, 

H x \ |a| < 1 

In each case the dimension of the model is the same under both H 0 and H t . 

Structural Change and Unit Roots 

In all the studies on unit roots, the issue of whether a time series is of the DS 
or TS type was decided by analyzing the series for the entire time period during 
which many major events took place. The Nelson-Plosser series, for instance, 
covered the period 1909-1970. which includes the two world wars and the 
Depression of the 1930s. If there have been any changes in the trend because 
of these events, the results obtained by assuming a constant parameter struc¬ 
ture during the entire period will be suspect. Many studies done using the tra¬ 
ditional multiple regression methods have included dummy variables (see Sec¬ 
tions 8.2 and 8.3) to allow for different intercepts (and slopes). Rappoport and 
Richlin (1989) show that a segmented trend model is a feasible alternative to 
the DS model. 21 

Perron (1989) argues that standard tests for unit root hypothesis against the 
trend-stationary (TS) alternatives cannot reject the unit root hypothesis if the 
time series has a structural break. 22 Of course, one can also construct examples 
where, for instance, 

y„ y 2 , . . . , y m is a random walk with drift 

y ,„,, . . . , y m+n is another random walk with a different drift 

and the combined series is not the DS type. Perron’s study was criticized on 
the argument that he “peeked at the data” before analysis—that after looking 
at the graph, he decided that there was a break. But Kim (1990), using Bayesian 
methods, finds that even allowing for an unknown breakpoint, the standard 
tests of the unit root hypothesis were biased in favor of accepting the unit root 
hypothesis if the series had a structural break at some intermediate date. 23 

When using long time series, as many of these studies have done, it is im¬ 
portant to take account of structural changes. Parameter constancy tests have 


2I P. Rappoport and L. Reichlin, “Segmented Trends and Non-stationary Time Series,” Eco¬ 
nomic Journal, Vol. 99, conference 1989, pp. 168-177. 

22 P. Perron, “The Great Crash, the Oil Price Shock and the Unit Root Hypothesis,” Econo- 
metrica, Vol. 57, 1989, pp. 1361-1401. 

23 In-Moo Kim, “Structural Change and Unit Roots,” unpublished Ph.D. dissertation. Univer¬ 
sity of Florida, 1990. Kim studies the Nelson-Plosser and Friedman-Schwartz real per capita 
GNP series for U.S. (annual) and quarterly OECD data. Another study that argues that many 
of the U.S. economic time series that were considered to be DS type are not necessarily so and 
that the evidence on the DS versus TS issue is mixed is G. D. Rudebusch, “Trends and Random 
Walks in Macroeconomic Time Series: A Re-examination,” Federal Reserve Paper 1139, Wash¬ 
ington, D.C., December 1990. 
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frequently been used in traditional regression analysis. But somehow, all the 
traditional diagnostic tests are ignored when it comes to the issue of the DS 
versus TS analysis. The main issue with economic time series is how best to 
model dynamic economic models. The testing for unit roots has received a lot 
more attention than the estimation aspect. 

14.6 Cointegration 


An important issue in econometrics is the need to integrate short-run dynamics 
with long-run equilibrium. The traditional approach to the modeling of short- 
run disequilibrium is the partial adjustment model discussed in Section 10.6. 
An extension of this is the ECM (error correction model), which also incor¬ 
porates past period’s disequilibrium [see equation (10.17)]. The analysis of 
short-run dynamics is often done by first eliminating trends in the variables, 
usually by differencing. This procedure, however, throws away potential valu¬ 
able information about long-run relationships about which economic theories 
have a lot to say. The theory of cointegration developed in Granger (1981) and 
elaborated in Engle and Granger (1987) 24 addresses this issue of integrating 
short-run dynamics with long-run equilibrium. We discussed this briefly in Sec¬ 
tion 6.10. We now go through it in greater detail. 

We start with a few definitions. A time series y, is said to be integrated of 
order 1 or 1(1) if Ay, is a stationary time series. A stationary time series is said 
to be 1(0). A random walk is a special case of an 1(1) series, because, if y, is a 
random walk. Ay, is a random series or white noise. White noise is a special 
case of a stationary series. A time series y, is said to be integrated of order 2 
or is 1(2) if Ay, is 1(1), and so on. If y, ~ 1(1) and u, ~ 1(0), then their sum Z, = 
y, + m, ~ 1(1). 

Suppose that y, ~ 1(1) and x, ~ 1(1). Then y, and x, are said to be cointegrated 
if there exists a (3 such that y, — (3jc, is 1(0). This is denoted by saying y, and jc, 
are CI(1, l). 25 What this means is that the regression equation 

y, = (3x, + u, 

makes sense because y, and x, do not drift too far apart from each other over 
time. Thus there is a long-run equilibrium relationship between them. If y, and 
jc, are not cointegrated, that is, y, - (3 jc, = u, is also 1(1), they can drift apart 
from each other more and more as time goes on. Thus there is no long-run 
equilibrium relationship between them. In this case the relationship between y, 
and jc, that we obtain by regressing y, on x, is “spurious” (see Section 6.3). 


24 C. W. J. Granger, “Some Properties of Time Series Data and Their Use in Econometric Model 
Specification,” Journal of Econometrics, Vol. 16, No. 1, 1981, pp. 121-130; R. F. Engle and 
C. W. J. Granger. “Cointegration and Error Correction: Representation. Estimation and Test¬ 
ing,” Econometrica, Vol. 55, No. 2, 1987, pp. 251-276. 

25 More generally, if y, — 1(d) and x, ~ 1(d), then y, and x, ~ Cl (d, b) if y, — (3x, ~ l(d - b) with 
b > 0. 
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In the Box-Jenkins method, if the time series is nonstationary (as evidenced 
by the correlogram not damping), we difference the series to achieve station- 
arity and then use elaborate ARM A models to fit the stationary series. When 
we are considering two time series, y, and x, say, we do the same thing. This 
differencing operation eliminates the trend or long-term movement in the se¬ 
ries. However, what we may be interested in is explaining the relationship be¬ 
tween the trends in y, and x,. We can do this by running a regression of y, on 
x„ but this regression will not make sense if a long-run relationship does not 
exist. By asking the question of whether y, and x, are cointegrated, we are 
asking whether there is any long-run relationship between the trends in y, 
and x,. 

The case with seasonal adjustment is similar. Instead of eliminating the sea¬ 
sonal components from y and x and then analyzing the deseasonalized data, we 
might also be asking whether there is a relationship between the seasonals in y 
and x. This is the idea behind “seasonal cointegration.” 26 Note that in this case 
we do not consider first differences or 1(1) processes. For instance, with 
monthly data we consider 12th differences y, - y, , 2 . Similarly, for x, we con¬ 
sider x, - x, , 2 . 

When we talk of common trends, we have to distinguish between what are 
commonly called deterministic and stochastic trends. In Section 6.10 we talked 
about these as the TSP (trend stationary process) and DSP (difference station¬ 
ary process). Detrending (by running a regression on time) assumes the pres¬ 
ence of a deterministic trend, and differencing assumes the presence of a sto¬ 
chastic trend. The concept of cointegration refers to the idea of common 
stochastic trends. But this is not the only kind of common trends. One can also 
have common deterministic trends. 21 The same cohcept extends to seasonals 
as well. Seasonal adjustment using dummy variables assumes a deterministic 
seasonal, and seasonal adjustment by differencing as discussed in the Box- 
Jenkins approach in Section 13.6 assumes a stochastic seasonal. The concept 
of seasonal cointegration applies to stochastic seasonal. In practice, both de¬ 
terministic and stochastic components could be present in a time series, so that 
we can write the time series as 

X, = T, + S, + p,, + Tj, + E, 

where T, represents deterministic trends (e.g., a polynomial in t), S, represents 
a deterministic seasonal (e.g., seasonal dummy variables), |x, represents a sto¬ 
chastic trend [e.g., an 1(1) process], and q, represents a stochastic seasonal 
[e.g., with quarterly data, (1 - L 4 ) is stationary]. 

Ignoring the presence of deterministic components leads to some misleading 
inferences on cointegration. But to simplify our analysis and to concentrate on 
the issues of cointegration, we shall assume that there are no deterministic 
elements in the time series we consider. 

“S. Hylleberg, R. F. Engle, C. W. J. Granger and S. Yoo, “Seasonal Integration and Co-inte¬ 
gration,” Working Paper. University of California at San Diego. 1988. 

Z7 See H. Kang, “Common Deterministic Trends, Common Factors, and Co-integration,” in 
Fomby and Rhodes (eds.). Advances in Econometrics, pp. 249-269. 
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14.7 The Cointegrating Regression 


To fix ideas, we shall consider the simple example considered by Engle and 
Granger (1987, p. 263). Consider two possibly correlated white noise errors, e„ 
and e 2l . Let x u and x lt be two series generated by the following model: 


x„ + 3*2, = u u u lt = + e u (14.6) 

*i, + ax 2l = u 2 , u 2t = p« 2 .,-i + e 2l |p| < 1 (14.7) 


Note that u u ~ 1(1) and u lt ~ 1(0). The model is internally consistent only if 
a ^ (3. The reason for this constraint is that if a = 3, it is impossible to find 
any values for x u and x 2t that simultaneously satisfy both equalities. The param¬ 
eters a and (3 are unidentified in the usual sense because there are no exogenous 
variables, and the errors are correlated. The reduced forms for x Xt and x 2 , are 

a 3 

= -„«i,- a u » 

a — 3 a — 3 


1 1 

* 2 , “-+ -^« 2 , 

a — 3 a — 3 


They are linear combinations of and u 2 „ and hence, they are both 1(1). Note 
that equation (14.7) describes a linear combination of two 1(1) variables that is 
stationary. Thus, x u and x 2l are cointegrated. 

In this case a linear least squares regression of x lt on x 2l produces a consistent 
estimate of a that is actually “super-consistent,” that is, it tends to the true 
value faster than the usual OLS estimator. In the usual case, if 3 is the least 
squares estimator of 3, VT (3 - 3) 0 whereas in the case here T(3 - 3) 

0 as T-* oo. This regression of x u on x 2l is called the “cointegrating regression.” 
All other linear combinations of x u and x 2 „ other than the cointegrating regres¬ 
sion (14.7), will have an infinite variance. There is no simultaneous equations 
bias in the estimation by OLS of equation (14.7) because the correlation be¬ 
tween x 2 , and u 2 , is of a lower order in T than the variance of x 2l which tends to 
infinity as This is the case whether we regress jc, on x 2 to get an estimate 

of ( — a) or we regress x 2 on x x to get an estimate of (- 1/a). Note that if p = I, 
u 2 , is also 1(1) and then we do not have a cointegrating recession. 

Note that equations (14.6) and (14.7) can be written in the autoregressive 

form 


= 38 * i , i-i + <* 38 * 2.,-1 + 'fia 
Ax 2 , = - a8x 2 ,,_i + ri 2 , 


(14.8) 


where 8 = (1 — p)/(a — 3) and t]„ and t] 2 , are linear combinations of the e's. 
If we define z, = x u + ax 2 „ we can write these as 


Ax„ = 38z,_, + tj„ 
Ax 2 , = -8 z,- x + t| 2 , 


(14.9) 
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Equations (14.8) give the VAR (vector autoregressions) representation for this 
simple model. 

An error correction model (ECM) is of the form 

Ay, = -y,Ax, + -yCy - P*), i + u, 

It relates the change in y to the change in x and the past period’s disequilibrium. 
The ECM in this form for the model we have been considering can be derived 
simply by noting that we have defined z, = x u + <xx 2r Hence, by equation 
(14.7), we have 

z, = pz, i + e 2l or 

Az, = (p - Hz,..! + e 2 , or (14.10) 

A*„ = —aAx 2l + (p - l)z,_i + e 2t 

However, note that when estimated by OLS, this equation gives inconsistent 
estimates of the parameters because of the correlation between x 2l and e 2l . Note 
also that all the variables in this equation are 1(0). 

Equations (14.9) can also be regarded as ECM representations except that in 
this model Ax„ does not involve Ajc 2 „ and vice versa. When estimated by OLS, 
equations (14.9) give consistent estimates of the parameters because t]„ and t] 2( 
are serially uncorrelated. We can get a consistent estimate of (3 from the esti¬ 
mation of equations (14.9). 

One question we might ask at this stage is: How have we managed to identify 
the parameters a and (3 in equations (14.6) and (14.7)? The answer is that we 
have done this by exploiting the information in the specification of the error 
terms. m„ is a random walk and u 2 , is 1(0). Although by considering a linear 
combination of the two equations we can generate an equation that looks like 
each, no linear combination can generate an 1(0) error in equation (14.7). Hence 
a is identified. Similarly, no linear combination can generate a random walk 
error as in (14.6). Thus (3 is identified. Equation (14.7) can be estimated by OLS 
to get a consistent estimate of a. This is free of the simultaneity bias because 
of the nature of x 2 „ which is 1(1), and u 2 „ which is 1(0). We then construct z, 
and get an estimate of p from equation (14.9). 

Engle and Granger suggest estimating the cointegrating regression first (note 
that this is a static regression that is, a regression with no dynamics or lags) 
and then estimating the short-run dynamics through variants of the ECM by a 
two-stage estimation method using the estimated coefficient from the cointe¬ 
grating regression. As discussed in Section 6.10, others have suggested esti¬ 
mating the long-run parameters and short-run dynamics simultaneously. Ba- 
nerjee et al. 28 perform a Monte Carlo study based on a model similar to that 
given by equation (14.6) and (14.7) and find that in small samples, the estimates 


“A. Banerjee, J. Dolado, D. F. Hendry, and G. Smith, “Exploring Equilibrium Relationships 
in Econometrics Through Static Models: Some Monte Carlo Evidence,” Oxford Bulletin of 
Economics and Statistics, Vol. 48, 1986, pp. 253-277. 
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of a from the static regression (14.7) are biased. They suggest that it is better 
to estimate the long-run parameter through a dynamic model. 

What has been said about the regressions in the 1(1) variables can also be 
said about seasonal data. Our discussion of regressions involving variables with 
stochastic trends suggests that if y is differenced and x is not, so that y is 1(0) 
and x is 1(1), a regression of y on * does not make sense. If both y and x have 
trends, so that y ~ 1(1) and x ~ 1(1), a regression of y on x does not make sense 
unless they are cointegrated, that is, there exists a (3 such that y — px ~ 1(0). 
This is a case of common stochastic trends. Similar is the case with seasonal 
data. If y is seasonally adjusted and x is not, a regression of y on x does not 
make sense. If both y and x have stochastic seasonal elements, a regression of 
y on x makes sense only if they are seasonally cointegrated, that is, there are 
common seasonal elements. Also note that if y, and x, are both 1(1), a regression 
of the form Ay, = 3 Ax, + yx, 2 + u , does not make sense because Ay„ Ax„ 
and u, are all 1(0) but x, 2 is not. It is 1(1). Thus all the variables are not on the 
same level. 


14.8 Vector Autoregressions 
and Cointegration 

There is a simple relationship between vector autoregressions and cointegra¬ 
tion. In the two-variable case we have considered, if the characteristic roots of 
the matrix of coefficients 29 in the VAR model are both equal to unity, the series 
are both 1(1) but not cointegrated; if precisely one of the roots is unity, the 
series are cointegrated. If neither of the roots is unity, the series are stationary, 
so they are neither integrated nor cointegrated. In the example we have con¬ 
sidered the VAR model given by equation (14.8) can be written as 


x„ = (1 + p8)x u „, + 

ap8x 2 ,_ 

i 

1 

11 

k 1 

+ (1 - 

a8)x 2i ,.... 

The matrix of coefficients is 



A = 

1 + p8 

aP8 

r\ — 

-8 

1 — a8 


The characteristic roots are 1 and 1 - a8 + p8. Thus the series are cointe¬ 
grated. Note that if p = 1, then 8 = 0 and we have two unit roots. In this case 
x, and x 2 are not cointegrated. If we consider the matrix of coefficients in the 
equations (14.8), we have to talk of zero roots rather than unit roots since the 
matrix of coefficients is A — I. Note that A — I is a singular matrix. It can be 
written as 


w See Appendix to Chapter 7 for a discussion of characteristic roots. 
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A ~ * = L ? - “ 81 (14 ' U) 

How do we find the cointegrating relationship from the VAR model? The 
procedure is as follows: Find the characteristic roots. Then corresponding to 
each root, find the characteristic vector. This is obtained by solving the equa¬ 
tions (A - XI)C = 0. For instance, corresponding to the root X = 1, we have 

p8C, + a(38C 2 = 0 
— 8C?! + a8C 2 = 0 

This gives C, = —a, C 2 = 1. Similarly, for the other root (1 — X8 — (38) we 
get the characteristic vector as C, = -p, C 2 = 1. Consider the matrix with 
these vectors as columns. This is 

\ 

_ [-a -p 

L 1 l . 

Now invert this matrix. We have 

R-' = —!— [ 1 P 

P - a ~ 1 ~aj ' 

Then the rows in this matrix give the required linear combinations: 

x, + p* 2 is nonstationary (corresponding to the unit root). 

—x, - ax 2 is stationary. 

In this example, we started out with a VAR model with one unit root. In 
practice, we have to test for the unit roots. We do this as follows. Let the root 
closest to 1 be denoted by y. Then we consider n(y — 1) and refer to the tables 
for n( p - 1) or «(p M - l) in Fuller (1976, p. 371), depending on whether p. is 
known or estimated, n is the sample size. As an example, consider the VAR 
model with y, = income and C, = consumption, based on 53 observations 
(1898-1950), which produced the following results: 30 

’Ay,] [34.16] [0.055 -O.lio] 

AC,J [31.50j [0.291 —0.371 J 

The matrix A — I has characteristic roots -0.0424 and —0.2740. The roots of 
A are obtained by adding 1 to each. They are 0.9576 and 0.8260. To test whether 
the root 0.9576 is significantly different from 1, consider 53(0.9576 - 1) = 
— 2.20. This is not less than the tabulated 5% value in Fuller (1976, p. 371), 
which is -13.29. Thus this root is not significantly different from 1. We next 
compute the matrix R of characteristic vectors and R -1 . (These computations 
are left as an exercise.) The two rows of R"' give the result that (C, - 2.99y,) 
is (approximately) a unit root process and (C, — 0.88y,) is (approximately) 


“This example is from D. A. Dickey, “Testing for Unit Roots in Vector Processes and Its Re¬ 
lation to Cointegration,” in Fomby and Rhodes (eds.). Advances in Econometrics, pp. 87-105. 
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stationary. The latter result says that 0.88 is the long-run marginal propensity 
to consume. 

In the two-variable case, the cointegration coefficient, if it exists, is uniquely 
determined. Also, in this case, the matrix (A - I) for the VAR model has rank 

1. As we saw in equation (14.11), it can be expressed as CB', where C and B 
are row vectors and B' gives the cointegrating vector. In the case of more than 
two variables, there can be more than one cointegrating regressions and these 
need not be uniquely determined. For instance, suppose that there are n vari¬ 
ables. Suppose there are (n - r) unit roots and r cointegrating vectors. In this 
case the matrix A - I will be of rank r < n. As before, we can then write 

A - I = CB' 

where C and B are n x r matrices. The rows of B' are the r distinct cointe¬ 
grating vectors. However, they may not all have meaningful economic inter¬ 
pretation and we have to choose the linear combinations that make economic 
sense. In the case where there are (n - 1) unit roots, so that r = 1, the coin¬ 
tegrating vector, if it exists, will be unique. The determination of the cointe¬ 
grating vectors (and their number) for a general VAR model with n variables 
and k lags is described by Johansen. 31 This is complicated for our purpose. 
Instead, we consider a procedure that can be applied for a VAR model with 
one lag if there is only one unit root, that is, all the other variables are driven 
by one stochastic trend. Thus if we have three variables, n = 3, {n - r) = 1 
or r = 2. Thus there are two cointegrating vectors. The procedure of deriving 
them starting with a VAR model is as follows: 

1. Write down the matrix of coefficients in the AR model. 

2. Find the characteristic roots. 

3. Find the characteristic vectors (see the Appendix to Chapter 7). These 
are the vectors x that solve the equation (A - AI)x = 0. 

4. Let R be the matrix whose columns are these vectors. Find R Then 
the rows of R~* corresponding to the nonunit roots give the coefficients 
of the cointegrating regressions. The row corresponding to the unit root 
gives the coefficients of the unit root process. 

An example will illustrate this procedure. 32 Consider the VAR model 


X , 

y, 

— 

0.8 

- 0.2 

- 0.38 

0.56 

- 0 . 02 " 

0.04 

•*,-1 

+ 

e\t 

ei, 

z,. 


- 0.28 

- 0.28 

0.72 





51 S. Johansen, “Statistical Analysis of Cointegration Vectors,” Journal of Economic Dynamics 
and Control, Vol. 12, 1988, pp. 231-254, and S. G. Hall, “Maximum Likelihood Estimation of 
Cointegration Vectors: An Example of the Johansen Procedure,” Oxford Bulletin of Economics 
and Statistics, Vol. 51, No. 2, 1989, pp. 213-218, and D. A. Dickey, D. W. Jansen, and D. L. 
Thornton, “A Primer on Cointegration with an Application to Money and Income.” Federal 
Reserve Bank of St. Louis, March-April 1991, pp. 58-78. 

32 This example is adapted from Dickey, “Unit Roots.” 
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The characteristic roots of the matrix A of the coefficients in the VAR model 
are 1.0, 0.66, and 0.42. The matrix R whose columns are the characteristic 
vectors corresponding to these roots is 


and 


-0.8165 0.1879 0.4483 
0.4082 0.0167 0.4082 
0.4082 0.9820 0.7952 


R = 

R- 1 = 


-0.8363 

-0.3408 

0.8503 


0.6275 

-1.7957 

1.8956 


0.1473 

1.1141 

-0.1950 


Note that the R 1 AR is a diagonal matrix with diagonal elements 1, 0.66, and 
0.42. Hence 


ft = -0.8363x + 0.6275j + 0.1473z 

f = - 0 . 3408 a - 1.7957y + 1.1141z 

ft = 0 . 8503 a + 1.8956y - 0.1950z 

are the three linear functions we need, f is a unit root process; f and f are 
stationary processes. They are cointegrating regressions. 

However, note that linear combinations of f 2 and f are also stationary. For 
instance, take f and f and eliminate y. We get (a + 2z) as a cointegration 

equation. Similarly, eliminating a, we get y — z as a cointegration equation, 

and eliminating z, we get a + 2y as a cointegration equation. Note that in this 
model the matrix (A — I) has one zero root and hence rank 2. It can be written 
as 


-0.2 -0.38 

-0.2 -0.44 

-0.28 -0.28 

The rows of B' give the cointegrating vectors, which are a + 2z and y — z that 
we derived earlier. Which of these different cointegrating equations we choose 
depends on which of them has a meaningful economic interpretation. There is 
one other thing we can do in this example. We have identified/, as the common 
stochastic trend. Hence if we regress a, y, and z in turn on/„ the residuals will 
be stationary. 

To simplify the exposition, we have discussed the VAR model with only one 
lag. Suppose instead that we have a general VAR model with k lags: 

x, = A,x,_, + A 2 x,„ 2 + • • • + A a x,„* + e, (14.12) 

This can be written as 

Ax, = B, Ax,_, + B 2 Ax ,„2 + • • • + B*_, Ax,_* +1 + B*x,_* + e, (14.13) 

where B, = —I + A, + A 2 + • • • + A„ / = 1, 2, . . . k. If x, is 1(1), then Ax, 
is 1(0). If some linear combinations of x, are stationary, that is, there are some 
cointegrating relationships among the variables in x„ then the matrix B A should 
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not be of full rank, where B A = — / + A, + A 2 + • • • + A k . Note that in the 
earlier discussion we had k = 1, and hence we considered (-/ + A,). 

The preceding discussion also suggests that if some of the variables in a VAR 
model are cointegrated, this implies some restrictions on the parameters of the 
VAR model. In Section 14.3 we pointed out that predictions from the unre¬ 
stricted VAR models were not good, and hence some restrictions on the param¬ 
eters were imposed in the Bayesian VAR (BVAR) approach. The cointegration 
theory gives a theoretical basis for imposing some restrictions on the VAR 
model. It has been found that predictions from the VAR model improved with 
restrictions imposed by cointegration theory. 33 However, many of the compar¬ 
isons made have been with unrestricted VAR rather than the BVAR. What we 
need to do is compare the predictions from VAR models that use cointegration 
restrictions with those generated by BVAR. Note that as we described in Sec¬ 
tion 14.3, the Bayesian VAR approach, by assuming a prior coefficient of unity 
on the own lagged terms, implicitly assumes that all the variables in the VAR 
model are unit root processes. 

Suppose we consider a set of 3 variables, all of which are unit root processes. 
Suppose also that there are 2 cointegrating relationships among them. Then 
since any linear combination of cointegrated relationships is also cointegrated, 
it becomes very difficult to give any economic interpretation to the cointegrated 
relations. Each of them is a long-run equilibrium relationship, and all linear 
combinations are equilibrium relationships. There is, thus, an identification 
problem, and unless we have some extraneous information, we cannot identify 
the long-run equilibrium relationship. This has been the experience with the 
estimation of some long-run demand for money functions. For instance, Johan¬ 
sen and Juselius 34 estimate the demand for money functions for Denmark and 
Finland using quarterly data. For the Danish data the sample was 1974-1 to 
1987-3 (55 observations). For the Finnish data the sample was 1958-1 to 
1984-3 (67 observations). For the Danish data, there was only one cointegrated 
relationship, and this simplified the interpretation of the cointegrating vector as 
a long-run demand for money function. But for the Finnish data there were 3 
cointegrating vectors, and this caused problems of interpretation. 

This is not surprising because “cointegration” is a purely statistical concept 
based on properties of the time-series considered. It is “A-theoretical Econo¬ 
metrics.” Cointegrated relationships need not have any economic meaning. But 
even if they do not, they can be used to improve predictions from the VAR 
models. 

One important contribution of cointegration tests is to the modelling of VAR 
systems, whether they should be in levels or first differences or both with some 


33 R. F. Engle and B. S. Yoo, “Forecasting and Testing in Cointegrating Systems,” Journal of 
Econometrics, Vol. 35, 1987, pp. 143-159. 

34 S. Johansen and K. Juselius, "Maximum Likelihood Estimation and Inference on Cointegra¬ 
tion with Applications to the Demand for Money,” Oxford Bulletin of Economics and Statistics, 
Vol. 52, 1990, pp. 169-210. 
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restrictions. For this purpose the cointegration relationships need not have any 
economic interpretation. The cointegrated relations are of value only in deter¬ 
mining the restrictions of the VAR system. (It is all part of a-theoretical econ¬ 
ometrics anyway.) 

If a set of unit root variables satisfies a cointegration relation, simple first 
differencing of all the variables can lead to econometric problems. 35 In the gen¬ 
eral VAR system with n variables, if all the variables are nonstationary, using 
an unrestricted VAR in levels is appropriate. If the variables are all 1(1) but 
no-cointegration relation exists, then application of an unrestricted VAR in 
first differences is appropriate. If there are r cointegrating relationships, then 
we need to model the system as a VAR in the r stationary combinations and 
(n - r) differences of the original variables. 

In any case, when our interest is in forecasting, the existence of some coin¬ 
tegrating relationships, even if they do not have any economic interpretation, 
should help us to improve the forecasts from the VAR models. The multiplicity 
of cointegrating vectors (and the resulting identification problems mentioned 
earlier) need not worry us. Cointegration relationships that do not make any 
economic sense need not be discarded. In fact, this is the most important use 
of cointegration tests and cointegrating relationships. 

14.9 Cointegration and 

Error Correction Models 

If x, and y, are cointegrated, there is a long-run relationship between them. 
Furthermore, the short-run dynamics can be described by the error correction 
model (ECM). This is known as the Granger representation theorem. 

If x, ~ 1(1), y, ~ 1(1), and z, = y, — 0x, is 1(0), then x and y are said to be 
cointegrated. The Granger representation theorem says that in this case x, and 
y, may be considered to be generated by ECMs of the form 

Ax, = p,z,_, + lagged(Ax„ Ay,) + e„ 

Ay, = p 2 z,_i + lagged(Ax„ Ay,) + e 2 , 

where at least one of p, and p 2 is nonzero and e„ and e 2 , are white-noise errors. 

Granger and Lee 36 suggest a further generalization of the concept of cointe¬ 
gration. Define tv, = Xj-o z > j that is, w, is an accumulated sum of z, or 
Aw, = z,. Since z, ~ 1(0), w, will be 1(1). Then x, and y, are said to be multi- 
cointegrated if x, and tv, are cointegrated. In this case y, and w, will also be 
cointegrated. It follows that u, = w, — ax, ~ 1(0), where a is the cointegration 


,5 See P. C. B. Phillips, “Optimal Inference in Cointegrated Systems,” Econometrica. Vol. 59. 
1991, pp. 283-306. 

36 G. W. J. Granger and Tae-Hwy Lee, “Multicointegration,” in Fomby and Rhoades (eds.). 
Advances in Econometrics, pp. 71-84. 
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constant. If x, and y, are multicointegrated, Granger and Lee show that they 
have the following (generalized) ECM representation: 

Lx, = p,z,_, + 8,n,_, + lagged(Ax„ Ay,) + e„ 

Ay, = p 2 Z,-i + 8 2 w, 1 + lagged( Ax„ Ay,) + e 2( 

Examples of this are: x, = sales, y, = production, z, — y, — x, = inventory 
change and w, = inventory. Sales, production, and inventory are all 1(1) and 
possibly cointegrated; z„ the inventory change, is 1(0). 


14.10 Tests for Cointegration 


An important ingredient in the analysis of cointegrated systems is tests for coin¬ 
tegration. Consider, first, the case of two variables, x and y. We first apply unit 
root tests to check that x and y are both 1(1). We next regress y on x (or x on 
y) and consider i 2 = y - (3x. We then apply unit root tests on u. 

If x and y are cointegrated, u = y - (3 jc is 1(0). On the other hand, if they 
are not cointegrated, u will be 1(1). Since unit root tests will be applied to u, 
the null hypothesis (as we discussed in Section 14.5) is that there is a unit root. 
Thus the null hypothesis and the alternative in cointegration tests are: 

H 0 : u has a unit root or x and y are not cointegrated 
H{. x and y are cointegrated 

The additional problem here is that u is not observed. Hence we use the esti¬ 
mated residual it from the cointegrating regression. Engle and Granger (1987) 
suggest several cointegration tests but suggest that using the ADF test to test 
for unit roots in u , is best. 

An alternative procedure is to use the VAR model, compute the character¬ 
istic roots of the matrix A of the coefficients of the VAR model (or the roots of 
the matrix B, in equation (14.13) in the case of a general AR model) and apply 
the tests described earlier; that is, consider n(k — 1) and use the tables in 
Fuller (1976, p. 371). 

The case with more than two variables is more complicated. If there is only 
one unit root, the procedures we described earlier based on the VAR model 
can be used. If there is more than one unit root, the Johansen procedure (re¬ 
ferred to earlier) has to be used. 

Of particular importance since the significance levels used are the conven¬ 
tional 1% and 5% levels is the meaning of the null and alternative hypotheses 
in cointegration tests. In the unit root tests, the null hypothesis is that there is 
a unit root. That is, we maintain that the time series are difference stationary. 
This null hypothesis is maintained unless there is overwhelming evidence to 
reject it. In the case of cointegration tests, the null hypothesis is that there is 
no cointegration. That is, we maintain that there are no long-run relationships. 
This null hypothesis is maintained unless there is overwhelming evidence to 
reject it. The way the null hypothesis and the alternatives are formulated and 
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the significance levels commonly used for these tests suggest that the dice are 
loaded in favor of unit roots and no cointegration. 

We can reverse the null and alternative hypotheses for the cointegration test 
if we reverse the null and alternative hypotheses for the unit root tests. That 
is, we adopt the null hypothesis and alternative in unit root tests as: 

H 0 : x, is stationary 

H t : x, is a unit root process 

with similar hypotheses for y,. Then for the cointegration test we have 

H 0 : x, and y, are cointegrated 
H t : x, and y, are not cointegrated 

We mentioned in Section 14.5 some unit root tests that use the null hypothesis 
of stationarity. 

14.11 Cointegration and Testing 
of the REH and MEH 


During recent years, cointegration theory has been used for testing the rational 
expectations hypothesis (REH) and the market efficiency hypothesis (MEH). 
In Section 10.11 we described some tests for the rationality of yf, where yf is 
the expectation of y, (obtained from survey data or other sources). The tests 
described there are, however, not valid if y, and/or yf are 1(1). For the validity 
of REH, y, — yf has to be 1(0) or else y, and y e , will be drifting further apart 
over time, in which case the rationality of yf is violated. However, it is not 
sufficient that the forecast error y, - yf be 1(0). The forecast error has to be 
free of serial correlation or be white noise. Hence for the rationality of expec¬ 
tations we require the following three conditions: 

1. y, and yf must be cointegrated. 

2. The cointegrating factor must be 1. 

3. The difference (y, — yf) must be a white noise process. 

Since the cointegrating factor is specified to be 1, we use what is known as 
the restricted cointegration test. That is, we first test, using unit root tests, 
whether y, and yf are both 1(1). We next consider p, = y, — yf and apply unit 
root tests to p,. If the null hypothesis of a unit root (a null hypothesis of no 
cointegration) is rejected, y, and yf are cointegrated with a cointegrated factor 
of 1. We next test by using the (^-statistics described in Section 13.5 for the 
presence of serial correlation in p,. 37 

’’Examples of such restricted cointegration tests of the REH appear in R. W. Hafer and S. E. 
Hein, “Comparing Futures and Survey Forecasts of Near Term Treasury Bill Rates,” Federal 
Reserve Bank of St. Louis Review, May-June 1989, pp. 33-42, and Peter C. Liu and G. S. 
Maddala, “Using Survey Data to Test Market Efficiency in the Foreign Exchange Markets,” 
Empirical Economics (1992), forthcoming. 
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As with the regression tests of the REH, the regression tests of the MEH are 
also not valid if the variables under consideration have unit roots. In this case, 
the cointegration tests should be applied. The exact form of the MEH differs, 
depending on the markets being considered. For instance, in the case of the 
gold and silver markets, it should not be possible to forecast one price from the 
other. Thus gold and silver prices should not be cointegrated. In the case of 
foreign exchange rates, if the currency markets are efficient, spot exchange 
rates should embody all relevant information, and it should not be possible to 
forecast one spot exchange rate as a function of another. That is, the spot ex¬ 
change rates across currencies should not be cointegrated. The same should be 
the case with forward rates. On the other hand, the future spot rate and the 
forward rate should be cointegrated because the forward rate is a predictor of 
the future spot rate. 38 The test of the last hypothesis in the regression context 
starts by estimating the equation 

S,+ i = a + (3F, + e, 

The MEH states that a = 0 and (3 = 1. However, if both S, + , and F, are 1(1) 
series, the error e, is not a stationary white noise process unless the two vari¬ 
ables are cointegrated with a = 0 and (3=1. Some economists try to avoid 
the nonstationary problem by considering the following equation. 

S t+l - S, = a + (3(F, - S,) + e, 

Then they test whether or not a = 0 and (3 = 1. However, the model is useful 
only if ( F, — S,) is nonstationary. For simplicity, assume that both variables are 
pure random walk processes. Then the left-hand side of the equation is station¬ 
ary because (S, +1 - S t ) is a white noise process. But there is no guarantee that 
the variable on the right-hand side, F, - S„ is stationary. Note that (F, - S,) 
can be decomposed into (F, - F,_,) + (F,_, - S,). While the first component 
is stationary, the second will have the same property only if the MEH holds. 39 
One can think of regressing the first difference of 5, +) on the first difference of 
F,. But this model is misspecified and gives inconsistent estimates of the pa¬ 
rameters because the correct model to be used, if the two variables are coin¬ 
tegrated, is the error correction model, 

S l+1 ~S, = a + (3(F, - F f _,) + -y(F,_, - S.) + e, 

The regression in first differences omits the last term, thus causing an omit¬ 
ted variable bias. Thus if the exchange rate is nonstationary, most of the mod¬ 
els on testing the MEH are inappropriate. If 5, + , and F, are both 1(1), the 
propef- procedure is to use the restricted cointegration test using S l+l - F,. 


“Craig Hakkio and Mark Rush, “Market Efficiency and Cointegration: An Application to the 
Sterling and Deutsche Mark Exchange Markets,” Journal of International Money and Finance, 
March 1989, pp. 75-88. 

"Thus, if F, - S, is stationary, there is no point in further testing the MEH. If F, - S, is 
nonstationary, we have a regression of a stationary variable on a nonstationary variable. Hence 
plim (3 = 0 and the MEH is rejected. 
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14.12 A Summary Assessment 

of Cointegration 

Cointegration tests have been used for a wide variety of problems, such as 
testing the permanent income hypothesis, testing rationality of expectations, 
testing market efficiency in different markets, and testing purchasing power 
parity. There are, however, many problems with the use of these tests and their 
interpretation. In a way, in the case of both unit roots and cointegration, there 
is too much emphasis on testing and too little on estimation. 

We discussed earlier the way the null and alternative hypothesis for the unit 
root tests and cointegration tests are formulated, and we have also discussed 
the arbitrariness in the universal use of the 5% and 1% significance levels. Con¬ 
clusions may be reversed if the null and alternative are reversed. For instance, 
when the test is conducted with the null hypothesis as a unit root, the null 
hypothesis is not usually rejected if the conventional significance levels are 
used, and if the null hypothesis is that the time series is stationary (with the 
alternative that it has a unit root), again the null hypothesis is not rejected when 
the conventional significance levels are used. The same is the case with coin¬ 
tegration tests when the null hypothesis is of no cointegration, or of cointegra¬ 
tion. There is also the problem of the power of these tests, as discussed in 
Section 14.4. 

Another important issue is that of bivariate versus multivariate cointegrating 
regressions. The issue is similar to simple versus multiple regression. For in¬ 
stance, y and x, may not be cointegrated, but y, jc„ and x 2 may be cointegrated. 
If y, x v and x 2 are all 1(1) and there exists a linear combination of these that is 
1(0), so that y = |3,x, + (3 2 x 2 + e, where e is 1(0), then (y, x,, x 2 ) are cointe¬ 
grated. But when we consider 


y = + F 

since p = (B 2 x 2 + e is 1(1), we will find y and x, not to be cointegrated. This is 
the usual omitted-variable problem. In this case it is wrong to make inferences 
just because the hypothesis of no cointegration has not been rejected. For in¬ 
stance, if y and x refer to prices in two related markets, it is tempting, if the 
hypothesis of no cointegration is not rejected, to conclude immediately that the 
two markets are efficient. This is indeed incorrect. 

The analogy with the omitted-variable case in regression analysis cannot be 
pushed too far. In the case of the simple versus multiple regression that we 
have, if x, and x 2 are uncorrelated, the coefficient of x, will be the same in both 
the regressions (with and without x 2 ). In the case of cointegration, this condi¬ 
tion is not sufficient. When will the bivariate and multivariate tests of cointe¬ 
gration give the same results? The answer is: when there is only one unit root 
process driving all the variables. Suppose that we are considering four series 
on exchange rates. Then if the matrix in the VAR representation (discussed 
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earlier) has only one unit root, the bivariate and multivariate tests will give the 
same results. 

Another issue that arises in estimating the cointegrating regressions is the 
choice of the dependent variable. In the two-variable case, whether we regress 
y on x or x on y (and take the reciprocal of the regression coefficient) does not 
make any difference asymptotically, but it does in small samples. The issue is 
the same in the case of more than two variables. However, this problem does 
not arise if one is using the maximum likelihood (ML) method, as in the Johan¬ 
sen procedure (which we have only referred to, not discussed). The problem is 
similar to that of 2SLS versus LIML discussed in Chapter 9. The 2SLS esti¬ 
mates, which depend on the regression method are not, in general, invariant to 
the normalization adopted, whereas the LIML estimates, which depend on the 
ML method, are invariant. 

While estimating cointegrating regressions, many of the problems that we 
often discuss in the case of the usual regression and simultaneous equations 
models (e.g., omitted variables, parameter instability due to structural change, 
outliers, multicollinearity, heteroskedasticity, etc.) are often ignored, and the 
attention is concentrated on testing for cointegration, as if that were the ulti¬ 
mate objective of all analysis. Even if one is doing an analysis with 1(1) vari¬ 
ables, many of these problems should not be ignored, and they do affect tests 
for cointegration as well. 

One final issue is that of the long-run equilibrium economic relationships that 
the cointegrating regressions are supposed to capture. The earlier literature on 
partial adjustment models (discussed in Sections 10.6 and 10.7) was concerned 
with the estimation of long-run equilibrium relationships as well as the time 
lags involved in achieving equilibrium. In discussions of cointegration, the 
long-run relationships are estimated through static regressions and not much is 
said regarding the time lags required to achieve the equilibrium unless an ECM 
is also estimated. As argued earlier, a procedure of estimating both the long- 
run parameters and short-run dynamics jointly would be a better one and would 
also be in the spirit of the earlier discussions on the estimation of dynamic 
models. Also, given that the evidence in favor of unit root processes in most 
economic time series has been found to be fragile, preoccupation with cointe¬ 
gration as the sole vehicle for studying dynamic economic relationships is un¬ 
warranted. Estimation of standard ECM and VAR models with attendant di¬ 
agnostics might lead to less fragile inference. The ECM’s can be used to merge 
short-run and long-run forecasts in a consistent fashion. 


Summary 


1. Unrestricted VAR models suffer from the problem of overparametriza- 
tion. The Bayesian version (BVAR) has been found to give better results and 
has a good forecasting record. 
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2. There have been many tests suggested for unit roots. In most of these 
tests, the null hypothesis is that there is a unit root, and it is rejected only when 
there is strong evidence against it. Using these tests, most economic time series 
have been found to have unit roots. However, some tests have been devised 
that use stationarity as the null and unit root as the alternative hypotheses. 
Using these tests, most economic time series have been found not to have a 
unit root. Some other limitations of unit root tests have also been discussed. 
The evidence of unit roots in economic time series appears to be fragile. 

3. The theory of cointegration tries to study the interrelationships between 
long-run movements in economic time series. Most economic theories are 
about long-run behavior, and thus much important information relevant for 
testing these theories is lost if the time series is detrended or differenced, as in 
the Box-Jenkins approach, before any analysis is done. 

4. Cointegration implies the existence of an error correction model (ECM). 
It also implies some restrictions on the VAR model. 

5. Tests for cointegration specify the null hypothesis as no cointegration. 
This is because the unit root tests have the null hypothesis of unit root. Some 
problems with this have been discussed. 

6. Cointegration theory has been used to test the rational expectations hy¬ 
pothesis (REH) and the market efficiency hypothesis (MEH). The former rests 
on rejecting the hypothesis of no cointegration and the latter on acceptance of 
this hypothesis. The results of these tests are sensitive to whether we consider 
bivariate or multivariate relationships. For instance, x and y may not be coin¬ 
tegrated, but x, y, and z may be cointegrated. 


Exercises 


1. Explain whether the following statements are true (T), false (F), or uncer¬ 
tain (U). If a statement is not true in general but is true under some con¬ 
ditions, state the conditions. 

(a) An unrestricted VAR model gives bad forecasts because of the large 
number of lagged variables in each equation. 

(b) The overparametrization problem in the VAR model can be solved 
by using Almon or Koyck lags for the lagged coefficients. 

(c) The Bayesian VAR (BVAR) model introduces very unnatural re¬ 
strictions on the parameters in the model. Hence it cannot be ex¬ 
pected to give good forecasts. 

(d) Unit root tests are all biased toward acceptance of the unit root null 
hypothesis. 

(e) It is better to have the null hypothesis of stationarity with the alter¬ 
native as a unit root rather than the other way around. 

(f) The Dickey-Fuller tests are not very useful for testing unit roots. 
One should use the Phillips tests. 
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(g) The Box-Jenkins method of visual inspection of the correlogram in 
levels and first differences gives the same results as the Dickey- 
Fuller unit root tests. 

(h) If x, and y, both have unit roots, the coefficient of jq_, in the equation 
for x, and the coefficient of y,_, in the equation for y, will both be 
close to 1 when we estimate a VAR model for these two variables. 

(i) Unit root tests all have low power because the alternative is not well 
specified. 

(j) Unit root tests applied to macroeconomic time series have low 
power because of changes in the structure of the economy. 

(k) When considering simultaneous equations models in 1(1) variables, 
we do not have to worry about simultaneity bias. We can estimate 
the equations by OLS and get consistent estimates of the parame¬ 
ters. 

(l) If x, and y, are both 1(1) variables, since estimates of the regression 
coefficient are superconsistent, it really does not matter whether we 
regress x, on y, or y, on x t . 

(m) Cointegration implies Granger causality. 

(n) Suppose that we have the following pth-order representation of a 
vector x, of random variables: 

x, = A,x,_, + A 2 x ,_2 + • • • + A p x,„ p + e, 

The rank of A p gives the number of cointegrating relationships. 

(o) The representation in part (n) can be written as 

Ax, = B, Ax,_, + • • • + B p _, Ax,_ p _, + B p x,_ p + e, 

The rank of B p gives the number of cointegrating relationships. 

2. Explain how to construct hypotheses to test the following economic theo¬ 
ries by cointegration tests. 

(a) Market efficiency hypothesis, 

(b) Purchasing power parity theory. 

(c) Rational expectations hypothesis. 

3. Consider the following hypotheses: 

H 0 : a = 1 in y, = p. + ay,_, + u, (p, 5^ 0) 

H x \ |a| < 1 in Z, = p, + aZ,_ t + u, where Z - y, — 0, — fat 

Show that the OLS estimator a of a in y, = p. + ay,_, + u, tends to 1 
under H { . 

Exercises 4 to 10 are similar except that they use different data sets. 
Students should select the data set and the appropriate question. 

4. Using the data in Table 13.4, estimate a VAR model for C, and Y, with one 
lag and two lags. 

(a) Is the model with two lags better than the model with one lag? Use 
the AIC and BIC criteria (see Section 13.5). Also check for residual 
autocorrelations. 
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(b) Since the data are quarterly, regress the data on seasonal dummies, 
and compute the residuals. Repeat the analysis with these residuals 
(assuming that they are the observations). 

5. Consider the VAR models with one and two lags in Exercise 4. 

(a) Estimate the characteristic roots and vectors for the relevant mat¬ 
rices discussed in Section 14.8. Apply tests for unit roots and tests 
for cointegration. 

(b) If there is a cointegrating regression, estimate it from the character¬ 
istic vectors and also from the static regression as suggested by 
Granger and Engle. 

(c) Are the results different for the VAR models with one and two lags? 
Are they different from those from the static regressions? What do 
you conclude from these results? 

(d) Repeat parts (a) to (c) with the seasonally adjusted data (residuals 
from the regression on seasonal dummies). 

6. Repeat Exercises 4 and 5 using the data in Table 13.5. 

7. Repeat Exercises 4 and 5 using the data in Table 13.7. This time, when 
adjusting for seasonality, you have to regress the original series on a con¬ 
stant and 11 monthly dummies. 

8. For the data in Table 4.10: 

(a) Estimate a regression of HS on y and RR. 

(b) Estimate a VAR model with one lag. Compute the characteristic 
roots. Test for cointegration and estimate the cointegrating vectors, 
if any. 

(c) What sense can you make of the multiple regression estimated in 
part (a)? 

(d) Repeat the analysis with residuals from a regression of the raw data 
on quarterly dummies. 

9. In the data set in Table 13.8, examine which of the time series are trend 
stationary and which are difference stationary. Also investigate whether 
there are any cointegrating relationships among the series. 

10. Consider the data in Table 13.7. Take yearly data for each month (e.g., Mar 
52, Mar 53, Mar 54, ... , Mar 82; similarly, Nov 52, Nov 53, ... , Nov 
82). We thus have 12 annual time series. 

(a) For each of these series, apply unit root tests. 

(b) For each of these series, compute a VAR model with one lag. Apply 
unit root tests and check for any cointegrating regressions. 

(c) Compare the results of this analysis with those in the Exercise 7. 
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14 VECTOR AUTOREGRESSIONS, UNIT ROOTS. AND COINTEGRATION 


Table 14.1 Critical Values for Unit Root Tests 


Sample 

Size 

K-Test 

t-Test 

F-Test‘‘ 

1% 

5% 

1% 

5% 

1% 

5% 

AR (1) 







25 

-11.9 

-7.3 

-2.66 

-1.95 



50 

-12.9 

-7.7 

-2.62 

-1.95 



100 

-13.3 

-7.9 

-2.60 

-1.95 



250 

-13.6 

-8.0 

-2.58 

-1.95 



500 

-13.7 

-8.0 

-2.58 

-1.95 



00 

-13.8 

-8.1 

-2.58 

-1.95 



AR (1) with constant 






25 

-17.2 

-12.5 

-3.75 

-3.00 



50 

-18.9 

-13.3 

-3.58 

-2.93 



100 

-19.8 

-13.7 

-3.51 

-2.89 



250 

-20.3 

-14.0 

-3.46 

-2.88 



500 

-20.5 

-14.0 

-3.44 

-2.87 



00 

-20.7 

-14.1 

-3.43 

-2.86 



AR (1) with constant and trend 





25 

-22.5 ■ 

-17.9 

-4.38 

-3.60 

7.24 

10.61 

50 

-25.7 

-19.8 

-4.15 

-3.50 

6.73 

9.31 

100 

-27.4 

-20.7 

-4.04 

-3.45 

6.49 

8.73 

250 

-28.4 

-21.3 

-3.99 

-3.43 

6.34 

8.43 

500 

-28.9 

-21.5 

-3.98 

-3.42 

6.30 

8.34 

00 

-29.5 

-21.8 

-3.96 

-3.41 

6.25 

8.27 

* 

1! 

I 

t = (p- l)/SE(p) and F-test is for -y = 0 and p = 

1 in y, = a + yt + py,_ 

i + u ,. 

Source: W. A. Fuller, Introduction to Statistical Time Series (New York 

: Wiley, 1976), p. 371 

for the F-test and p. 373 for the t-test; D. A. Dickey and W. A. Fuller, “ 

Likelihood Ratio 

Statistics for Autoregressive Time Series with a Unit Root,” Econometrica, Vol. 49, No. 4, 


1981, p. 1063 for the F-test. 
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Table 14.2 Critical Values (5%) for the 
Cointegration Tests 


n 

T 

CRDW 

DF 

ADF“ 

2 

50 

0.78 

-3.67 

-3.29 


100 

0.39 

-3.37 

-3.17 


200 

0.20 

-3.37 

-3.25 

3 

50 

0.99 

-4.11 

-3.75 


100 

0.55 

-3.93 

-3.62 


200 

0.39 

-3.78 

-3.78 

4 

50 

1.10 

-4.35 

-3.98 


100 

0.65 

-4.22 

-4.02 


200 

0.48 

-4.18 

-4.13 

5 

50 

1.28 

-4.76 

-4.15 


100 

0.76 

-4.58 

-4.36 


200 

0.57 

-4.48 

-4.43 

“CRDW 

= X («»- 

- U, ,) 2 /X u\. 

CRDW means 



“cointegrating regression Durbin-Watson” statistic; DF 


= f-test for ot = 0 in A u, — an, , + n; ADF = (-test for 

P 

a = 0 in A u, = a«,_, + 2 4>.Ai5,_i + t),. In all these tests 

i 

u, is the residual from the cointegrating regression. 

Source • R. F. Engle and S. Yoo, “Forecasting and Testing 
in Cointegrated Systems,” Journal of Econometrics, Vol. 
35, 1987. 
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A Tables 


Table A.4 Cumulative Student’s r-Distribution 


m = 




r(n/2)ViTn 





dx 


n 

.75 

.90 

.95 

.975 

.99 

.995 

.9995 

i 

1.000 

3.078 

6.314 

12.706 

31.821 

63.657 

636.619 

2 

.816 

1.886 

2.920 

4.303 

6.965 

9.925 

31.598 

3 

.765 

1.638 

2.353 

3.182 

4.541 

5.841 

12.941 

4 

.741 

1.533 

2.132 

2.776 

3.747 

4.604 

8.610 

5 

.727 

1.476 

2.015 

2.571 

3.365 

4.032 

6.859 

6 

.718 

1.440 

1.943 

2.447 

3.143 

3.707 

5.959 

7 

.711 

1.415 

1.895 

2.365 

2.998 

3.499 

5.405 

8 

.706 

1.397 

1.860 

2.306 

2.8% 

3.355 

5.041 

9 

.703 

1.383 

1.833 

2.262 

2.821 

3.250 

4.781 

10 

.700 

1.372 

1.812 

2.228 

2.764 

3.169 

4.587 

11 

.697 

1.363 

1.796 

2.201 

2.718 

3.106 

4.437 

12 

.695 

1.356 

1.782 

2.179 

2.681 

3.055 

4.318 

13 

.694 

1.350 

1.771 

2.160 

2.650 

3.012 

4.221 

14 

.692 

1.345 

1.761 

2.145 

2.624 

2.977 

4.140 

15 

.691 

1.341 

1.753 

2.131 

2.602 

2.947 

4.073 

16 

.690 

1.337 

1.746 

2.120 

2.583 

2.921 

4.015 

17 

.689 

1.333 

1.740 

2.110 

2.567 

2.898 

3.965 

18 

.688 

1.330 

1.734 

2.101 

2.552 

2.878 

3.922 

19 

.688 

1.328 

1.729 

2.093 

2.539 

2.861 

3.883 

20 

.687 

1.325 

1.725 

2.086 

2.528 

2.845 

3.850 

21 

.686 

1.323 

1.721 

2.080 

2.518 

2.831 

3.819 

22 

.686 

1.321 

1.717 

2.074 

2.508 

2.819 

3.792 

23 

.685 

1.319 

1.714 

2.069 

2.500 

2.807 

3.767 

24 

.685 

1.318 

1.711 

2.064 

2.492 

2.797 

3.745 

25 

.684 

1.316 

1.708 

2.060 

2.485 

2.787 

3.725 

26 

.684 

1.315 

1.706 

2.056 

2.479 

2.779 

3.707 

27 

.684 

1.314 

1.703 

2.052 

2.473 

2.771 

3.690 

28 

.683 

1.313 

1.701 

2.048 

2.467 

2.763 

3.674 

29 

.683 

1.311 

1.699 

2.045 

2.462 

2.756 

3.659 

30 

.683 

1.310 

1.697 

2.042 

2.457 

2.750 

3.646 

40 

.681 

1.303 

1.684 

2.021 

2.423 

2.704 

3.551 

60 

.679 

1.296 

1.671 

2.000 

2.390 

2.660 

3.460 

120 

.677 

1.289 

1.658 

1.980 

2.358 

2.617 

3.373 

00 

.674 

1.282 

1.645 

1.960 

2.326 

2.576 

3.291 


Source: Abridged from R. A. Fisher and Frank Yates, Statistical Tables, Oliver & Boyd, 
Edinburgh and London, 1938. It is here published with the kind permission of the authors and 
their publishers. 



Table A.5 Critical Values for the Durbin-Watson Test: 5% Significance Level' 
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multiple equations, 469 
proxy variables, 464-466 
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interval estimation, 27-28 
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method of moments, 66 
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asymptotic unbiasedness, 25 
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Exogenous variables, 357 
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adaptive expectations, 408-410 
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model-consistent expectations, 433 
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sufficient expectations, 433 
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Goodness of fit, 131, 332-334, 369, 540- 
541, 550-552 
Granger casuality, 393 
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consequences, 209-211 
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Glejser’s test, 204 
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maximum likelihood method, 213 
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Hypothesis testing 
general theory, 27-32 
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32-33 

Idempotent matrices, 55 
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363-364 
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Working’s concept. 385 
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classical, 22-26 
multiple regression, 134-135 
simple regression, 76-78 


Instrumental variables, 112, 367-369, 
461-462 

Inverse of a matrix, 48 
Inverse prediction, 101-102 
Irrelevant variables, 164-165 
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Koyck model 

estimation in autoregressive form, 411 
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Lagrangian multiplier (LM) test, 119-124, 
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Likelihood ratio (LR) test, 119-124, 206 
Limited information maximum likelihood 
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Methods of estimation 
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limited information maximum 
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maximum likelihood, 118 
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Moving average process, 531 
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definition, 270 
dropping variables, 289-291 
extraneous estimates, 293 
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problems in measuring, 276 
ridge regression, 281-283 
Theil’s measure, 275 
Multiple regression 
analysis of variance, 157 
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statistical inference, 134-135 
test for linear functions of parameters, 
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Nonnested hypothesis tests, 514-518 
Normal distribution, 19 
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Omitted variables 
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interpretation of Hausman’s test, 510— 
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test for, 477 
Orthognal matrices, 49 
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illustrated, 89-95 

Partial adjustment models, 419-423 
PE test, 223 

Piecewise regression, 185 
Polynomial lags, 423-427 


Positive and negative definite matrices, 
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Posterior odds for model selection, 503- 
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Principal component regression, 284-285 

Probability, 12-14 

Probability distributions, 17-21 

Probit model, 327-329 

Proxy variables, 464-466 

PSW test, 513-514 

(7-statistics, 540-541 
Qualitative variables (see Dummy 
variables) 

Ramsey’s RESET test, 478 
Rao’s score test, 119-124 
Rational expectations 
defined,431-433 

estimation of a demand and supply 
model, 436-442 
serial correlation problem, 443 
tests for rationality, 434-435, 599 
Rational lags, 429 
Recursive systems, 387 
Reduced form, 358-360 
Regression 

analysis of variance, 84, 156-157 
assumptions and specification, 65 
causation in regression, 75-76 
censored regression model, 339-342 
cointegrating regression, 590-591 
interpretation, 143-145 
inverse prediction, 101-102 
irrelevant variables, 164-165 
method of least squares, 69-71 
method of moments, 66 
outliers, 89-95 

prediction, 85, 154-155, 319-320 
statistical inference, 76-78, 134-135 
tests of hypotheses, 80 
truncated regression model, 342-343 
Regression fallacy, 105 
Regression with no constant term, 83 
Residuals 

BLUS residuals, 483 
predicted residuals, 481 
problems with least squares residuals, 
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recursive residuals, 483 
studentized residuals, 482 
Reverse regression, 71-72, 459-461 
Ridge regression, 281-283 
R-square 

in dummy dependent variable models, 
332-334 

in multiple regression models, 131 
relationship with F-ratios, 167-168 
in simultaneous equation models, 369 
in time-series models, 550 

Sampling distributions 
multiple regression, 134-135 
from normal populations, 26 
simple regression, 76-78 
Sargan’s test. 255 
Selection of regressors, 496-502 
Serial correlation 
in autoregressive models, 248 
LM test, 250 

in rational expectations models, 443- 
444 

in unit root tests, 583 
Significance levels 
criticism, 32 
defined, 29 
Sims’ test, 394 
Specification errors 
Hausman's test, 506-509 
irrelevant variables, 164-165 
omitted variables, 161-162 
Ramsey’s test, 4 7 8 
Sargan’s test in dynamic models, 255 
Specification searches, 491 
Spurious trends, 260-261 
Stationary time-series, 527-530 
Stochastic regressors, 126 
Structural change and unit roots, 587 

/-distribution, 21 
Tests of hypotheses 
for cointegration, 598 
general theory, 29-32 
for linear functions of parameters, 159 
in linear regression models, 80, 158— 
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for market efficiency, 599-600 
nonnested hypotheses, 514-518 
for rational expectations hypothesis, 
599 

for unit roots, 583-587 
Theil’s R 2 criterion, 497 
Time series 

ARIMA models, 538-541 
autoregressive, 533 
Box-Jenkins methods, 542-549 
forecasting, 544 
moving average, 531 
R 2 measures, 550-552 
stationarity, 527-530 
Trace of a matrix, 55 
Trends and random walks, 258-261 
Truncated variables 
criticism of the tobit model, 341 
tobit model, 338-342 
truncated regression model, 342-343 
TSP model, 259 
Two-stage least squares, 373 

Unbiasedness 
defined, 23 

least squares estimators, 114 
under stochastic regressors, 126 
Underestimation of standard errors 
under autocorrelation, 241-243 
under heteroskedasticity, 210-211 
Unit root tests 
Dickey-Fuller test, 583 
powers of unit root tests, 584 
serial correlation, 583 
structural change, 587 
Unit roots, 581-582 

Variance inflation factor (VIF), 274 
Vector autoregressions, 578-580 
Vector autoregressions and cointegration, 
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Wald test, 121-124, 256 
Weighted least squares, 212-213 
White’s test, 204 

Working’s concept of identification, 385 



